A Systematic Review of Unsupervised Learning Techniques for Software Defect Prediction

Background: Unsupervised machine learners have been increasingly applied to software defect prediction. It is an approach that may be valuable for software practitioners because it reduces the need for labeled training data. Objective: Investigate the use and performance of unsupervised learning techniques in software defect prediction. Method: We conducted a systematic literature review that identified 49 studies containing 2456 individual experimental results, which satisfied our inclusion criteria published between January 2000 and March 2018. In order to compare prediction performance across these studies in a consistent way, we (re-)computed the confusion matrices and employed the Matthews Correlation Coefficient (MCC) as our main performance measure. Results: Our meta-analysis shows that unsupervised models are comparable with supervised models for both within-project and cross-project prediction. Among the 14 families of unsupervised model, Fuzzy CMeans (FCM) and Fuzzy SOMs (FSOMs) perform best. In addition, where we were able to check, we found that almost 11% (262/2456) of published results (contained in 16 papers) were internally inconsistent and a further 33% (823/2456) provided insufficient details for us to check. Conclusion: Although many factors impact the performance of a classifier, e.g., dataset characteristics, broadly speaking, unsupervised classifiers do not seem to perform worse than the supervised classifiers in our review. However, we note a worrying prevalence of (i) demonstrably erroneous experimental results, (ii) undemanding benchmarks and (iii) incomplete reporting. We therefore encourage researchers to be comprehensive in their reporting.

[1]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[2]  Binghe Wang,et al.  A novel method for software defect prediction in the context of big data , 2017, 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA)(.

[3]  Mandeep Kaur,et al.  A K-Means Based Clustering Approach for Finding Faulty Modules in Open Source Software Systems , 2010 .

[4]  Michael R. Lyu,et al.  Software quality prediction using mixture models with EM algorithm , 2000, Proceedings First Asia-Pacific Conference on Quality Software.

[5]  Satwinder Singh,et al.  Classification of defective modules using object-oriented metrics , 2017, Int. J. Intell. Syst. Technol. Appl..

[6]  Peter A. Flach,et al.  Precision-Recall-Gain Curves: PR Analysis Done Right , 2015, NIPS.

[7]  Ali Selamat,et al.  An empirical study based on semi-supervised hybrid self-organizing map for software fault prediction , 2015, Knowl. Based Syst..

[8]  Naohiro Ishii,et al.  Error Prediction Methods for Embedded Software Development Using Hybrid Models of Self-Organizing Maps and Multiple Regression Analyses , 2013 .

[9]  John P. A. Ioannidis,et al.  A manifesto for reproducible science , 2017, Nature Human Behaviour.

[10]  John M. Chambers,et al.  Graphical Methods for Data Analysis , 1983 .

[11]  David Lo,et al.  Supervised vs Unsupervised Models: A Holistic Look at Effort-Aware Just-in-Time Defect Prediction , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[12]  Alexandre Boucher,et al.  Software metrics thresholds calculation techniques to predict fault-proneness: An empirical comparison , 2017, Inf. Softw. Technol..

[13]  Jaechang Nam,et al.  CLAMI: Defect Prediction on Unlabeled Datasets (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[14]  Ying Zou,et al.  Cross-Project Defect Prediction Using a Connectivity-Based Unsupervised Classifier , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[15]  Jun Yang,et al.  Defect Prediction on Unlabeled Datasets by Using Unsupervised Clustering , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[16]  Madhav D. Ingle,et al.  Hyper-Quad-Tree based K-Means Clustering Algorithm for Fault Prediction , 2013 .

[17]  Padmamala Sriram,et al.  Hyper-Quadtree-Based K-Means Algorithm for Software Fault Prediction , 2014 .

[18]  Ling Xu,et al.  Automated change-prone class prediction on unlabeled dataset using unsupervised method , 2017, Inf. Softw. Technol..

[19]  Manpreet Kaur,et al.  A Density Based Clustering approach for early detection of fault prone modules , 2010, 2010 International Conference on Electronics and Information Engineering.

[20]  Yuming Zhou,et al.  Code Churn: A Neglected Metric in Effort-Aware Just-in-Time Defect Prediction , 2017, 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[21]  Banu Diri,et al.  Clustering and Metrics Thresholds Based Software Fault Prediction of Unlabeled Program Modules , 2009, 2009 Sixth International Conference on Information Technology: New Generations.

[22]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[23]  Alexandre Boucher,et al.  Using Software Metrics Thresholds to Predict Fault-Prone Classes in Object-Oriented Software , 2016, 2016 4th Intl Conf on Applied Computing and Information Technology/3rd Intl Conf on Computational Science/Intelligence and Applied Informatics/1st Intl Conf on Big Data, Cloud Computing, Data Science & Engineering (ACIT-CSII-BCD).

[24]  Satwinder Singh,et al.  Comparative Performance of Fault-Prone Prediction Classes with K-means Clustering and MLP , 2016, ICTCS.

[25]  Pradeep Singh,et al.  An Efficient Software Fault Prediction Model using Cluster based Classification , 2014 .

[26]  Ahmed Ali Abdalla Esmin,et al.  Applying Swarm Ensemble Clustering Technique for Fault Prediction Using Software Metrics , 2014, 2014 13th International Conference on Machine Learning and Applications.

[27]  Mengning Yang,et al.  Self-learning Change-prone Class Prediction , 2016, SEKE.

[28]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[29]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[30]  Zsuzsanna Marian,et al.  A Novel Approach Using Fuzzy Self-Organizing Maps for Detecting Software Faults , 2016 .

[31]  Vandana Bhattacherjee,et al.  Application of K-Medoids with Kd-Tree for Software Fault Prediction , 2011, SOEN.

[32]  Yuming Zhou,et al.  Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models , 2016, SIGSOFT FSE.

[33]  Ali Selamat,et al.  Fault prediction by utilizing self-organizing Map and Threshold , 2013, 2013 IEEE International Conference on Control System, Computing and Engineering.

[34]  Xiao-Hua Zhou,et al.  Statistical Methods for Meta‐Analysis , 2008 .

[35]  Arashdeep Kaur,et al.  A clustering algorithm for software fault prediction , 2010, 2010 International Conference on Computer and Communication Technology (ICCCT).

[36]  Audris Mockus,et al.  A large-scale empirical study of just-in-time quality assurance , 2013, IEEE Transactions on Software Engineering.

[37]  Qian Yin,et al.  Software quality prediction using Affinity Propagation algorithm , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[38]  Taghi M. Khoshgoftaar,et al.  Analyzing software measurement data with clustering techniques , 2004, IEEE Intelligent Systems.

[39]  Luiz Fernando Capretz,et al.  Benchmarking Machine Learning Technologies for Software Defect Detection , 2015, ArXiv.

[40]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[41]  Tracy Hall,et al.  Researcher Bias: The Use of Machine Learning in Software Defect Prediction , 2014, IEEE Transactions on Software Engineering.

[42]  Vandana Bhattacherjee,et al.  Software Fault Prediction Using Quad Tree-Based K-Means Clustering Algorithm , 2012, IEEE Transactions on Knowledge and Data Engineering.

[43]  Anuradha Chug,et al.  Software Defect Prediction Using Supervised Learning Algorithm and Unsupervised Learning Algorithm , 2013 .

[44]  Alexandre Boucher,et al.  Predicting Fault-Prone Classes in Object-Oriented Software: An Adaptation of an Unsupervised Hybrid SOM Algorithm , 2017, 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[45]  Banu Diri,et al.  A systematic review of software fault prediction studies , 2009, Expert Syst. Appl..

[46]  Jill Cirasella,et al.  Beyond Beall’s List: Better understanding predatory publishers , 2015 .

[47]  R. Grissom,et al.  Effect Sizes for Research : Univariate and Multivariate Applications, Second Edition , 2005 .

[48]  Norman E. Fenton,et al.  A Critique of Software Defect Prediction Models , 1999, IEEE Trans. Software Eng..

[49]  Banu Diri,et al.  Metrics-Driven Software Quality Prediction Without Prior Fault Data , 2010 .

[50]  Jaechang Nam,et al.  CLAMI: Defect Prediction on Unlabeled Datasets , 2015, ASE 2015.

[51]  Yuming Zhou,et al.  The Utility Challenge of Privacy-Preserving Data-Sharing in Cross-Company Defect Prediction: An Empirical Study of the CLIFF&MORPH Algorithm , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[52]  Tracy Hall,et al.  DConfusion: a technique to allow cross study performance evaluation of fault prediction studies , 2013, Automated Software Engineering.

[53]  David Lo,et al.  File-Level Defect Prediction: Unsupervised vs. Supervised Models , 2017, 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[54]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[55]  Taghi M. Khoshgoftaar,et al.  Unsupervised learning for expert-based software quality estimation , 2004, Eighth IEEE International Symposium on High Assurance Systems Engineering, 2004. Proceedings..

[56]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[57]  Denis Borenstein,et al.  Is predatory publishing a real threat? Evidence from a large database study , 2018, Scientometrics.

[58]  Parvinder S. Sandhu,et al.  A Subtractive Clustering Based Approach for Early Prediction of Fault Proneness in Software Modules , 2010 .

[59]  G A Colditz,et al.  Understanding research synthesis (meta-analysis). , 1996, Annual review of public health.

[60]  B. Diri,et al.  A FAULT DETECTION STRATEGY FOR SOFTWARE PROJECTS , 2013 .

[61]  Qinbao Song,et al.  A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[62]  Ping Guo,et al.  Software Metrics Analysis with Genetic Algorithm and Affinity Propagation Clustering , 2008, DMIN.

[63]  Ali Selamat,et al.  Increasing the Accuracy of Software Fault Prediction using Majority Ranking Fuzzy Clustering , 2014, Int. J. Softw. Innov..

[64]  Tim Menzies,et al.  Revisiting unsupervised learning for defect prediction , 2017, ESEC/SIGSOFT FSE.

[65]  Xin Zheng,et al.  Software Metrics Data Clustering for Quality Prediction , 2006, ICIC.

[66]  Banu Diri,et al.  Software Fault Prediction of Unlabeled Program Modules , 2009 .

[67]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[68]  Naohiro Ishii,et al.  Clustering and Analyzing Embedded Software Development Projects Data Using Self-Organizing Maps , 2011, SERA.

[69]  Pearl Brereton,et al.  Evidence-Based Software Engineering and Systematic Reviews , 2015 .

[70]  Euyseok Hong,et al.  Software Fault Prediction Model using Clustering Algorithms Determining the Number of Clusters Automatically , 2014 .