SCUT: Multi-class imbalanced data classification using SMOTE and cluster-based undersampling

Class imbalance is a crucial problem in machine learning and occurs in many domains. Specifically, the two-class problem has received interest from researchers in recent years, leading to solutions for oil spill detection, tumour discovery and fraudulent credit card detection, amongst others. However, handling class imbalance in datasets that contains multiple classes, with varying degree of imbalance, has received limited attention. In such a multi-class imbalanced dataset, the classification model tends to favour the majority classes and incorrectly classify instances from the minority classes as belonging to the majority classes, leading to poor predictive accuracies. Further, there is a need to handle both the imbalances between classes as well as address the selection of examples within a class (i.e. the so-called within class imbalance). In this paper, we propose the SCUT hybrid sampling method, which is used to balance the number of training examples in such a multi-class setting. Our SCUT approach oversamples minority class examples through the generation of synthetic examples and employs cluster analysis in order to undersample majority classes. In addition, it handles both within-class and between-class imbalance. Our experimental results against a number of multi-class problems show that, when the SCUT method is used for pre-processing the data before classification, we obtain highly accurate models that compare favourably to the state-of-the-art.

[1]  Nathalie Japkowicz,et al.  Concept-Learning in the Presence of Between-Class and Within-Class Imbalances , 2001, Canadian Conference on AI.

[2]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[3]  María José del Jesús,et al.  Multi-class Imbalanced Data-Sets with Linguistic Fuzzy Rule Based Classification Systems Based on Pairwise Learning , 2010, IPMU.

[4]  Paulo Cortez,et al.  Modeling wine preferences by data mining from physicochemical properties , 2009, Decis. Support Syst..

[5]  Zhi-Hua Zhou,et al.  ON MULTI‐CLASS COST‐SENSITIVE LEARNING , 2006, Comput. Intell..

[6]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[7]  M. Mostafizur Rahman,et al.  Addressing the Class Imbalance Problem in Medical Datasets , 2013 .

[8]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[9]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[10]  Stan Matwin,et al.  Learning from Imbalanced Data Using Ensemble Methods and Cluster-Based Undersampling , 2014, NFMCP.

[11]  Jing Zhao,et al.  Artificial neural networks for predicting 3D protein shapes from amino acid sequences , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[12]  Yang Wang,et al.  Boosting for Learning Multiple Classes with Imbalanced Class Distribution , 2006, Sixth International Conference on Data Mining (ICDM'06).

[13]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[14]  A. Ramanan,et al.  Unbalanced Decision Trees for multi-class classification , 2007, 2007 International Conference on Industrial and Information Systems.

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .