Semi-Supervised Deep Fuzzy C-Mean Clustering for Software Fault Prediction

Software fault prediction is a consequential research area in software quality promise. In this paper, we propose a semi-supervised deep fuzzy C-mean (DFCM) clustering for software fault prediction, which is the cumulation of semi-supervised DFCM clustering and feature compression techniques. Deep is utilized for the feature-based multi clusters of unlabeled and labeled data sets along with their labeled classes. In our approach, for the training model, we simultaneously deal with the unsupervised data and supervised data to exploit the obnubilated information from unlabeled data to labeled data to support the construction of the precise model. We utilize DFCM clustering to handle the class imbalance problem and withal fuzzy theory logic is very akin to human logic and it is facile to comprehend. We further ameliorate the prediction performance with the coalescence of feature learning techniques-feature extraction and feature selection (using random-under sampling) to generate good features and remove irrelevant and redundant features to reduce the noisy data for classification. However, by the performance of the model results, the amalgamation of deep multi clusters and feature techniques work better due to their ability to identify and amalgamation essential information in data feature. The classification model is predicted on the maximum homogeneous between the features of labeled and unlabeled data, the model is trained on the un-noisy data set obtained by the deep coalescence of multi clusters and feature techniques. To check the efficacy of our approach, we chose data sets from real-world software project (NASA & Eclipse), and then we compared our approach with a number of latest classical base-line methods, and investigate the performance by using performance measures such as probability of detection, F-measure, and area under the curve.

[1]  Bojan Cukic,et al.  An iterative semi-supervised approach to software fault prediction , 2011, Promise '11.

[2]  Alexander Zien,et al.  Semi-Supervised Classification by Low Density Separation , 2005, AISTATS.

[3]  Thomas Ball,et al.  Static analysis tools as early indicators of pre-release defect density , 2005, ICSE.

[4]  Ronald R. Yager,et al.  On ordered weighted averaging aggregation operators in multicriteria decisionmaking , 1988, IEEE Trans. Syst. Man Cybern..

[5]  Claire Cardie,et al.  Using Decision Trees to Improve Case-Based Learning , 1993, ICML.

[6]  Witold Pedrycz,et al.  A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[7]  Thomas A. Runkler,et al.  Some issues in system identification using clustering , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[8]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[9]  Karim O. Elish,et al.  Predicting defect-prone software modules using support vector machines , 2008, J. Syst. Softw..

[10]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[11]  Chris Cornelis,et al.  Fuzzy-rough nearest neighbour classification and prediction , 2011, Theor. Comput. Sci..

[12]  Daoqiang Zhang,et al.  A simultaneous learning framework for clustering and classification , 2009, Pattern Recognit..

[13]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[14]  Cagatay Catal,et al.  A Comparison of Semi-Supervised Classification Approaches for Software Defect Prediction , 2014, J. Intell. Syst..

[15]  Neeraj Kumar Goyal,et al.  Predicting Fault-prone Software Module Using Data Mining Technique and Fuzzy Logic , 2010 .

[16]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[17]  Taghi M. Khoshgoftaar,et al.  Fault Prediction Modeling for Software Quality Estimation: Comparing Commonly Used Techniques , 2003, Empirical Software Engineering.

[18]  Bojan Cukic,et al.  Software defect prediction using semi-supervised learning with dimension reduction , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[19]  Nikhil R. Pal,et al.  Fuzzy logic approaches to structure preserving dimensionality reduction , 2002, IEEE Trans. Fuzzy Syst..

[20]  Ellen Riloff,et al.  Learning subjective nouns using extraction pattern bootstrapping , 2003, CoNLL.

[21]  Rajani K. Mudi,et al.  A new scheme for fuzzy rule-based system identification and its application to self-tuning fuzzy controllers , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[22]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[23]  Sunghun Kim,et al.  Reducing Features to Improve Code Change-Based Bug Prediction , 2013, IEEE Transactions on Software Engineering.

[24]  Liping Cao,et al.  A novel semi-supervised fuzzy c-means clustering method , 2009, 2009 Chinese Control and Decision Conference.

[25]  Xiang Chen,et al.  A Two-Stage Data Preprocessing Approach for Software Fault Prediction , 2014, 2014 Eighth International Conference on Software Security and Reliability.

[26]  Francisco Herrera,et al.  IFROWANN: Imbalanced Fuzzy-Rough Ordered Weighted Average Nearest Neighbor Classification , 2015, IEEE Transactions on Fuzzy Systems.

[27]  Francisco Herrera,et al.  Fuzzy rough classifiers for class imbalanced multi-instance data , 2016, Pattern Recognit..

[28]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[29]  Nitesh V. Chawla,et al.  Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains , 2011, J. Artif. Intell. Res..

[30]  Nikhil R. Pal,et al.  Simultaneous Structure Identification and Fuzzy Rule Generation for Takagi–Sugeno Models , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[31]  Bogdan Gabrys,et al.  Combining labelled and unlabelled data in the design of pattern classification systems , 2004, Int. J. Approx. Reason..

[32]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[33]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[34]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[35]  Taghi M. Khoshgoftaar,et al.  Choosing software metrics for defect prediction: an investigation on feature selection techniques , 2011, Softw. Pract. Exp..

[36]  Fuzhen Zhuang,et al.  Combining Supervised and Unsupervised Models via Unconstrained Probabilistic Embedding , 2011, IJCAI.

[37]  Subhashis Chatterjee,et al.  A new fuzzy rule based algorithm for estimating software faults in early phase of development , 2016, Soft Comput..

[38]  Ayse Basar Bener,et al.  Analysis of Naive Bayes' assumptions on software fault data: An empirical study , 2009, Data Knowl. Eng..

[39]  Licheng Jiao,et al.  A semi-supervised deep fuzzy C-mean clustering for two classes classification , 2017, 2017 IEEE 3rd Information Technology and Mechatronics Engineering Conference (ITOEC).

[40]  Nikhil R. Pal,et al.  Two efficient connectionist schemes for structure preserving dimensionality reduction , 1998, IEEE Trans. Neural Networks.

[41]  Xiao-Yuan Jing,et al.  Label propagation based semi-supervised learning for software defect prediction , 2016, Automated Software Engineering.

[42]  Abdelhamid Bouchachia,et al.  Learning with partly labeled data , 2007, Neural Computing and Applications.

[43]  H. Bian,et al.  Fuzzy-rough nearest-neighbor classification approach , 2003, 22nd International Conference of the North American Fuzzy Information Processing Society, NAFIPS 2003.

[44]  Nikhil R. Pal,et al.  A neuro-fuzzy scheme for simultaneous feature selection and fuzzy rule-based classification , 2004, IEEE Transactions on Neural Networks.

[45]  Germain Forestier,et al.  Semi-supervised learning using multiple clusterings with limited labeled data , 2016, Inf. Sci..

[46]  Nikhil R. Pal,et al.  Fuzzy Rule-Based Approach for Software Fault Prediction , 2017, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[47]  Banu Diri,et al.  Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem , 2009, Inf. Sci..

[48]  Taghi M. Khoshgoftaar,et al.  Learning with limited minority class data , 2007, ICMLA 2007.

[49]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[50]  Rajen B. Bhatt,et al.  FRCT: fuzzy-rough classification trees , 2007, Pattern Analysis and Applications.