CDBH: A clustering and density-based hybrid approach for imbalanced data classification

Abstract The problem of imbalanced data set classification is prevalent in the studies of machine learning and data mining. In these kinds of data sets, the number of samples in classes is unequal so that one class has a lot more samples (the majority or negative class) than the other (the minority or positive class). The classical classifiers are ineffective in these conditions because they are biased toward the majority class and ignore the minority class, which is more important. Preprocessing the data distribution before training the classifier is one of the most effective methods to resolve this problem. These methods, balance the data distribution by decreasing the majority class size (under-sampling methods) or increasing the minority class size (over-sampling methods) or combining both of them (hybrid methods). In this paper, we propose an effective and simple hybrid approach based on the density concept and clustering, which is called Clustering and Density-Based Hybrid (CDBH). First, the minority class samples are clustered by the well-known k-means algorithm and their densities in each cluster are obtained. Then, the denser minority samples are selected with more likely to generate the new minority samples. To decrease the majority class size, the k-means algorithm is applied again on the majority class samples to cluster them and compute their densities, like the previous stage. Finally, the denser majority samples will have more chance to choose from the training set, and other samples are removed to balance the data samples distribution between classes. In the experiments, the Support Vector Machine (SVM) classifier is used as the classifier, and F-measure and AUC criteria are employed for evaluation. Also, preprocessing methods are compared in terms of the complexity of the classification model and the over-sampling rate. The results of comparing CDBH and other state of the art methods over 44 imbalanced data sets show the superiority of the proposed CDBH method based on the F-measure criterion.

[1]  Zahir Tari,et al.  KRNN: k Rare-class Nearest Neighbour classification , 2017, Pattern Recognit..

[2]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[3]  Nuno Vasconcelos,et al.  Cost-Sensitive Support Vector Machines , 2012, Neurocomputing.

[4]  David A. Cieslak,et al.  Combating imbalance in network intrusion datasets , 2006, 2006 IEEE International Conference on Granular Computing.

[5]  STURE HOLM Chalmers,et al.  Board of the Foundation of the Scandinavian Journal of Statistics A Simple Sequentially Rejective Multiple Test Procedure , 2008 .

[6]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[7]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[8]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.

[9]  Jian Yang,et al.  Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling , 2013, Neurocomputing.

[10]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[11]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[12]  Francisco Herrera,et al.  Fuzzy-rough imbalanced learning for the diagnosis of High Voltage Circuit Breaker maintenance: The SMOTE-FRST-2T algorithm , 2016, Eng. Appl. Artif. Intell..

[13]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..

[14]  Zhe Wang,et al.  Gravitational fixed radius nearest neighbor for imbalanced problem , 2015, Knowl. Based Syst..

[15]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[16]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[17]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[18]  Hossein Nezamabadi-pour,et al.  An improvement to gravitational fixed radius nearest neighbor for imbalanced problem , 2017, 2017 Artificial Intelligence and Signal Processing Conference (AISP).

[19]  Maurice K. Wong,et al.  Algorithm AS136: A k-means clustering algorithm. , 1979 .

[20]  Sai-Ho Ling,et al.  A hybrid evolutionary preprocessing method for imbalanced datasets , 2018, Inf. Sci..

[21]  Hossein Nezamabadi-pour,et al.  A memetic approach for training set selection in imbalanced data sets , 2019, International Journal of Machine Learning and Cybernetics.

[22]  Hossein Nezamabadi-pour,et al.  A data clustering approach based on universal gravity rule , 2015, Eng. Appl. Artif. Intell..

[23]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[24]  Hossein Nezamabadi-pour,et al.  NPC: Neighbors' progressive competition algorithm for classification of imbalanced data sets , 2017, 2017 3rd Iranian Conference on Intelligent Systems and Signal Processing (ICSPIS).

[25]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[26]  Zhi Chen,et al.  A synthetic neighborhood generation based ensemble learning for the imbalanced data classification , 2017, Applied Intelligence.

[27]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[28]  Kamaljit Kaur,et al.  Review of Existing Methods for Finding Initial Clusters in K-means Algorithm , 2013 .

[29]  Bahareh Nikpour,et al.  Proposing new method to improve gravitational fixed nearest neighbor algorithm for imbalanced data classification , 2017, 2017 2nd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC).

[30]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[31]  Francisco Herrera,et al.  Evolutionary-based selection of generalized instances for imbalanced classification , 2012, Knowl. Based Syst..

[32]  Swagatam Das,et al.  Near-Bayesian Support Vector Machines for imbalanced data classification with equal or unequal misclassification costs , 2015, Neural Networks.

[33]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[34]  Hossein Nezamabadi-pour,et al.  HTSS: a hyper-heuristic training set selection method for imbalanced data sets , 2018, Iran J. Comput. Sci..

[35]  Eneko Osaba,et al.  Ensemble classification for imbalanced data based on feature space partitioning and hybrid metaheuristics , 2019, Applied Intelligence.

[36]  Jian Gao,et al.  A new sampling method for classifying imbalanced data based on support vector machine ensemble , 2016, Neurocomputing.

[37]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[38]  José Salvador Sánchez,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[39]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[40]  Patricio A. Vela,et al.  A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[41]  Francisco Herrera,et al.  IFROWANN: Imbalanced Fuzzy-Rough Ordered Weighted Average Nearest Neighbor Classification , 2015, IEEE Transactions on Fuzzy Systems.

[42]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[43]  Kihoon Yoon,et al.  A data reduction approach for resolving the imbalanced data issue in functional genomics , 2007, Neural Computing and Applications.

[44]  Changyin Sun,et al.  Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data , 2015, Knowl. Based Syst..

[45]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[46]  ZhouZhi-Hua,et al.  Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2006 .

[47]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[48]  Chih-Fong Tsai,et al.  Clustering-based undersampling in class-imbalanced data , 2017, Inf. Sci..

[49]  Bin Gu,et al.  Cross Validation Through Two-Dimensional Solution Surface for Cost-Sensitive SVM , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Ludmila I. Kuncheva,et al.  Instance selection improves geometric mean accuracy: a study on imbalanced data classification , 2018, Progress in Artificial Intelligence.

[51]  Kay Chen Tan,et al.  Evolutionary Cluster-Based Synthetic Oversampling Ensemble (ECO-Ensemble) for Imbalance Learning , 2017, IEEE Transactions on Cybernetics.

[52]  Hossein Nezamabadi-pour,et al.  An effective codebook initialization technique for LBG algorithm using subtractive clustering , 2014, 2014 Iranian Conference on Intelligent Systems (ICIS).

[53]  Patel Harshita,et al.  Classification of Imbalanced Data Using a Modified Fuzzy-Neighbor Weighted Approach , 2017 .

[54]  Jing Zhao,et al.  ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data , 2013, Neurocomputing.

[55]  Chih-Fong Tsai,et al.  Under-sampling class imbalanced datasets by combining clustering analysis and instance selection , 2019, Inf. Sci..

[56]  Xin Yao,et al.  Dynamic Sampling Approach to Training Neural Networks for Multiclass Imbalance Classification , 2013, IEEE Transactions on Neural Networks and Learning Systems.