A Fault Prediction Model with Limited Fault Data to Improve Test Process

Software fault prediction models are used to identify the fault-prone software modules and produce reliable software. Performance of a software fault prediction model is correlated with available software metrics and fault data. In some occasions, there may be few software modules having fault data and therefore, prediction models using only labeled data can not provide accurate results. Semi-supervised learning approaches which benefit from unlabeled and labeled data may be applied in this case. In this paper, we propose an artificial immune system based semi-supervised learning approach. Proposed approach uses a recent semi-supervised algorithm called YATSI (Yet Another Two Stage Idea) and in the first stage of YATSI, AIRS (Artificial Immune Recognition Systems) is applied. In addition, AIRS, RF (Random Forests) classifier, AIRS based YATSI, and RF based YATSI are benchmarked. Experimental results showed that while YATSI algorithm improved the performance of AIRS, it diminished the performance of RF for unbalanced datasets. Furthermore, performance of AIRS based YATSI is comparable with RF which is the best machine learning classifier according to some researches.

[1]  Khaled El Emam,et al.  Comparing case-based reasoning classifiers for predicting high risk software components , 2001, J. Syst. Softw..

[2]  A. B. Watkins,et al.  A resource limited artificial immune classifier , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[3]  Taghi M. Khoshgoftaar,et al.  Tree-based software quality estimation models for fault prediction , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[4]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[5]  Kurt Driessens,et al.  Using Weighted Nearest Neighbor to Benefit from Unlabeled Data , 2006, PAKDD.

[6]  Leandro Nunes de Castro,et al.  The Clonal Selection Algorithm with Engineering Applications 1 , 2000 .

[7]  Xin Jin,et al.  Random Forest and PCA for Self-Organizing Maps based Automatic Music Genre Discrimination , 2006, DMIN.

[8]  Tim Menzies,et al.  The \{PROMISE\} Repository of Software Engineering Databases. , 2005 .

[9]  Bojan Cukic,et al.  Robust prediction of fault-proneness by random forests , 2004, 15th International Symposium on Software Reliability Engineering.

[10]  Bojan Cukic,et al.  A Statistical Framework for the Prediction of Fault-Proneness , 2007 .

[11]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[12]  David Elworthy,et al.  Does Baum-Welch Re-estimation Help Taggers? , 1994, ANLP.

[13]  Chris F. Kemerer,et al.  A Metrics Suite for Object Oriented Design , 2015, IEEE Trans. Software Eng..

[14]  Ian Gilchrist,et al.  Software Quality Engineering: Testing, Quality Assurance, and Quantifiable Improvement. By Jeff Tian. Published jointly by John Wiley & Sons, Inc., Hoboken, NJ, U.S.A. and IEEE Computer Society Press, Los Alamitos, CA, U.S.A., 2005. ISBN: 0-471-71345-7, pp 412: Book Reviews , 2006 .

[15]  T. Huang Performance Comparisons of Semi-Supervised Learning Algorithms , 2005 .

[16]  Taghi M. Khoshgoftaar,et al.  Software Quality Classification Modeling Using the SPRINT Decision Tree Algorithm , 2003, Int. J. Artif. Intell. Tools.

[17]  Douglas C. Schmidt,et al.  Ultra-Large-Scale Systems: The Software Challenge of the Future , 2006 .

[18]  Avrim Blum,et al.  Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[19]  H. J. Scudder,et al.  Probability of error of some adaptive pattern-recognition machines , 1965, IEEE Trans. Inf. Theory.

[20]  Fabio Gagliardi Cozman,et al.  Unlabeled Data Can Degrade Classification Performance of Generative Classifiers , 2002, FLAIRS.

[21]  Jason Brownlee,et al.  Artificial immune recognition system (AIRS): a review and analysis , 2005 .

[22]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[23]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[24]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[25]  ASHOK K. AGRAWALA,et al.  Learning with a probabilistic teacher , 1970, IEEE Trans. Inf. Theory.

[26]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[27]  Banu Diri,et al.  Software Fault Prediction with Object-Oriented Metrics Based Artificial Immune Recognition System , 2007, PROFES.

[28]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[29]  Charles X. Ling,et al.  AUC: A Better Measure than Accuracy in Comparing Learning Algorithms , 2003, Canadian Conference on AI.

[30]  Taghi M. Khoshgoftaar,et al.  Unsupervised learning for expert-based software quality estimation , 2004, Eighth IEEE International Symposium on High Assurance Systems Engineering, 2004. Proceedings..

[31]  Taghi M. Khoshgoftaar,et al.  An application of fuzzy clustering to software quality prediction , 2000, Proceedings 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology.

[32]  Taghi M. Khoshgoftaar,et al.  Semi-supervised learning for software quality estimation , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[33]  Edward B. Allen,et al.  GP-based software quality prediction , 1998 .

[34]  Fabio Gagliardi Cozman,et al.  Semi-Supervised Learning of Mixture Models , 2003, ICML.

[35]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[36]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[37]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[38]  David J. Miller,et al.  A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data , 1996, NIPS.

[39]  David A. Landgrebe,et al.  The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon , 1994, IEEE Trans. Geosci. Remote. Sens..

[40]  Tong-Seng Quah,et al.  Application of neural networks for software quality prediction using object-oriented metrics , 2005, J. Syst. Softw..

[41]  Bojan Cukic,et al.  Predicting fault prone modules by the Dempster-Shafer belief networks , 2003, 18th IEEE International Conference on Automated Software Engineering, 2003. Proceedings..

[42]  Stanley C. Fralick,et al.  Learning to recognize patterns without a teacher , 1967, IEEE Trans. Inf. Theory.

[43]  Banu Diri,et al.  Software defect prediction using artificial immune recognition system , 2007 .

[44]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[45]  Shumeet Baluja,et al.  Probabilistic Modeling for Face Orientation Discrimination: Learning from Labeled and Unlabeled Data , 1998, NIPS.

[46]  Ian Witten,et al.  Data Mining , 2000 .

[47]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[48]  Mark Neal,et al.  Investigating the evolution and stability of a resource limited artificial immune system. , 2000 .

[49]  Shun-ichi Amari,et al.  A Theory of Pattern Recognition , 1968 .