Semi-Supervised Deep Fuzzy C-Mean Clustering for Imbalanced Multi-Class Classification

Semi-supervised learning has been successfully connected in the research fields of machine learning such as data mining and dynamic data analysis. Imbalance class learning is one of the most challenging issues for classification. In recent years, the core focal point of numerous researchers has been on data classification of multi-class imbalanced datasets. In this paper, we proposed semi-supervised deep Fuzzy C-mean clustering for imbalanced multi-class classification (DFCM-MC). In our paper, the word “Deep” is used to show how decomposition strategy is applied deeply, first, decomposes the original semi-supervised data into supervised (labeled) and unsupervised (unlabeled) data. For training the model, we used unlabeled data along with labeled data to extract discriminative information, which is useful for classification. Second, it further decomposes the supervised and unsupervised data into multi intra-cluster that to address the problem of multi-class imbalance data, which tends to maximize intra-cluster classes and intra-cluster features. We propose a novel approach DFCM-MC by utilizing multi-intra clusters to extract new features to control redundancy for multi-class imbalance classification, which associates the maximum similarity of features between multi-intra clusters. Furthermore, we improve the classification performance of the DFCM-MC, apply the re-sampling technique to handle the imbalance data for classification. We conduct our experiments on 18 benchmark multi-class imbalanced datasets to demonstrate the performance of our proposed approach with the four state-of-the-art learning algorithms for multi-class imbalance data with three performance measures (mean of accuracy, mean of f-measure, and mean of area under the curve). The experiment results demonstrate that our proposed approach performs better due to their capacity to recognize and consolidate fundamental information from unsupervised data.

[1]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[2]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[3]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[5]  Ronald R. Yager,et al.  On ordered weighted averaging aggregation operators in multicriteria decisionmaking , 1988, IEEE Trans. Syst. Man Cybern..

[6]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[7]  Nikhil R. Pal,et al.  Two efficient connectionist schemes for structure preserving dimensionality reduction , 1998, IEEE Trans. Neural Networks.

[8]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[9]  Nikhil R. Pal,et al.  Fuzzy logic approaches to structure preserving dimensionality reduction , 2002, IEEE Trans. Fuzzy Syst..

[10]  Rajani K. Mudi,et al.  A new scheme for fuzzy rule-based system identification and its application to self-tuning fuzzy controllers , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[11]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[12]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[13]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[14]  R. Mooney,et al.  Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering , 2003 .

[15]  R. Yusof,et al.  Automatic clustering of generalized regression neural network by similarity index based fuzzy c-means clustering , 2004, 2004 IEEE Region 10 Conference TENCON 2004..

[16]  Nikhil R. Pal,et al.  A neuro-fuzzy scheme for simultaneous feature selection and fuzzy rule-based classification , 2004, IEEE Transactions on Neural Networks.

[17]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[18]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[19]  Nitesh V. Chawla,et al.  Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains , 2011, J. Artif. Intell. Res..

[20]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  Taghi M. Khoshgoftaar,et al.  Learning with limited minority class data , 2007, ICMLA 2007.

[22]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[23]  Nozha Boujemaa,et al.  Active semi-supervised fuzzy clustering , 2008, Pattern Recognit..

[24]  Xin Li,et al.  Protein classification with imbalanced data , 2007, Proteins.

[25]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[26]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  A review on the combination of binary classifiers in multiclass problems , 2008, Artificial Intelligence Review.

[27]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[28]  Liping Cao,et al.  A novel semi-supervised fuzzy c-means clustering method , 2009, 2009 Chinese Control and Decision Conference.

[29]  Zhi-Hua Zhou,et al.  Semi-supervised learning using label mean , 2009, ICML '09.

[30]  Daoqiang Zhang,et al.  A simultaneous learning framework for clustering and classification , 2009, Pattern Recognit..

[31]  Svetha Venkatesh,et al.  Multi-class Pattern Classification in Imbalanced Data , 2010, 2010 20th International Conference on Pattern Recognition.

[32]  Huanhuan Chen,et al.  Negative correlation learning for classification ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[33]  Mikhail Belkin,et al.  Laplacian Support Vector Machines Trained in the Primal , 2009, J. Mach. Learn. Res..

[34]  Francisco Herrera,et al.  An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes , 2011, Pattern Recognit..

[35]  Pedro Antonio Gutiérrez,et al.  A dynamic over-sampling procedure based on sensitivity for multi-class problems , 2011, Pattern Recognit..

[36]  Taghi M. Khoshgoftaar,et al.  Choosing software metrics for defect prediction: an investigation on feature selection techniques , 2011, Softw. Pract. Exp..

[37]  Fuzhen Zhuang,et al.  Combining Supervised and Unsupervised Models via Unconstrained Probabilistic Embedding , 2011, IJCAI.

[38]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[39]  Yong Shi,et al.  Laplacian twin support vector machine for semi-supervised classification , 2012, Neural Networks.

[40]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[41]  Nitesh V. Chawla,et al.  Building Decision Trees for the Multi-class Imbalance Problem , 2012, PAKDD.

[42]  Taghi M. Khoshgoftaar,et al.  An Empirical Study on the Stability of Feature Selection for Imbalanced Software Engineering Data , 2012, 2012 11th International Conference on Machine Learning and Applications.

[43]  Xibei Yang,et al.  Recognition of Multiple Imbalanced Cancer Types Based on DNA Microarray Data Using Ensemble Classifiers , 2013, BioMed research international.

[44]  Bojana Dalbelo Basic,et al.  Stability of Software Defect Prediction in Relation to Levels of Data Imbalance , 2013, SQAMIA.

[45]  Mikel Galar,et al.  Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches , 2013, Knowl. Based Syst..

[46]  Zhi-Hua Zhou,et al.  Learning Imbalanced Multi-class Data with Optimal Dichotomy Weights , 2013, 2013 IEEE 13th International Conference on Data Mining.

[47]  Sunghun Kim,et al.  Reducing Features to Improve Code Change-Based Bug Prediction , 2013, IEEE Transactions on Software Engineering.

[48]  Jongmoon Baik,et al.  Value-cognitive boosting with a support vector machine for cross-project defect prediction , 2014, Empirical Software Engineering.

[49]  Xiang Chen,et al.  A Two-Stage Data Preprocessing Approach for Software Fault Prediction , 2014, 2014 Eighth International Conference on Software Security and Reliability.

[50]  Long Thanh Ngo,et al.  Semi-supervised fuzzy C-means clustering for change detection from multispectral satellite image , 2015, 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[51]  Cuijie Zhao,et al.  Fuzzy C-Means Clustering Based on Improved Marked Watershed Transformation , 2016 .

[52]  Francisco Herrera,et al.  Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data , 2016, Knowl. Based Syst..

[53]  Germain Forestier,et al.  Semi-supervised learning using multiple clusterings with limited labeled data , 2016, Inf. Sci..

[54]  Bartosz Krawczyk,et al.  Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets , 2016, Pattern Recognit..

[55]  Francisco Herrera,et al.  Fuzzy rough classifiers for class imbalanced multi-instance data , 2016, Pattern Recognit..

[56]  Frédéric Jurie,et al.  Vehicle detection in aerial imagery : A small target detection benchmark , 2016, J. Vis. Commun. Image Represent..

[57]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[58]  Licheng Jiao,et al.  A semi-supervised deep fuzzy C-mean clustering for two classes classification , 2017, 2017 IEEE 3rd Information Technology and Mechatronics Engineering Conference (ITOEC).

[59]  Francisco Herrera,et al.  NMC: nearest matrix classification - A new combination model for pruning One-vs-One ensembles by transforming the aggregation problem , 2017, Inf. Fusion.

[60]  Francisco Herrera,et al.  Dynamic affinity-based classification of multi-class imbalanced data with one-versus-one decomposition: a fuzzy rough set approach , 2018, Knowledge and Information Systems.

[61]  Shujuan Jiang,et al.  The Performance Stability of Defect Prediction Models with Class Imbalance: An Empirical Study , 2017, IEICE Trans. Inf. Syst..

[62]  L. Jiao,et al.  Fuzzy Rough C-Mean Based Unsupervised CNN Clustering for Large-Scale Image Data , 2018, Applied Sciences.

[63]  Francisco Herrera,et al.  Dynamic ensemble selection for multi-class imbalanced datasets , 2018, Inf. Sci..

[64]  Licheng Jiao,et al.  The Empirical Study of Semi-Supervised Deep Fuzzy C-Mean Clustering for Software Fault Prediction , 2018, IEEE Access.

[65]  Licheng Jiao,et al.  Semi-Supervised Deep Fuzzy C-Mean Clustering for Software Fault Prediction , 2018, IEEE Access.

[66]  Chongsheng Zhang,et al.  An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme , 2018, Knowl. Based Syst..

[67]  Licheng Jiao,et al.  Rough Noise-Filtered Easy Ensemble for Software Fault Prediction , 2018 .