DeepCPDP: Deep Learning Based Cross-Project Defect Prediction

Cross-project defect prediction (CPDP) is an active research topic in the domain of software defect prediction, since CPDP can be applied to the following scenarios: the target project for software defect prediction is a new project or the target project does not have enough labeled modules. Most of the previous work tried to utilize the labeled dataset gathered from other projects (i.e., the source projects) and then proposed transfer learning based methods to reduce the data distribution difference between different projects. In this article, we propose a deep learning based CPDP method DeepCPDP. For this method, we represent source code of each extracted program module by using simplified abstract syntax tree (SimAST). For a node of SimAST, we only keep its node type, since this is project-independent, while we ignore the name of method and variable, since these information are project-specific. Therefore, SimAST is project-independent and especially suitable for the task of CPDP. Then, we extract the token vector from each module after it is modeled via SimAST. Moreover, we design a new unsupervised based embedding method SimASTToken2Vec to learn meaningful representation for these extracted token vectors. Later, we employ Bi-directional Long Short-Term Memory (BiLSTM) neural network to automatically learn semantic features from embedded token vectors. In addition, we use attention mechanism over the BiLSTM layer to learn the weight of the vectors from the learned semantic features. Finally, we construct CPDP models via Logistic regression classifier. To show the effectiveness of DeepCPDP, ten large-scale projects from different application domains are used and AUC measure is used to measure the prediction performance of trained models. By using Scott-Knott test, we can find DeepCPDP can significantly outperform eight state-of-the-art baselines. Moreover, we also verify that the usage of SimASTToken2Vec, BiLSTM and attention mechanism is competitive in our proposed method.

[1]  Jens Grabowski,et al.  A Comparative Study to Benchmark Cross-Project Defect Prediction Approaches , 2018, IEEE Transactions on Software Engineering.

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Koichiro Ochimizu,et al.  Towards logistic regression models for predicting fault-prone code across software projects , 2009, ESEM 2009.

[4]  Guangchun Luo,et al.  Transfer learning for cross-company software defect prediction , 2012, Inf. Softw. Technol..

[5]  Enio G. Jelihovschi,et al.  ScottKnott: A Package for Performing the Scott-Knott Clustering Algorithm in R , 2014 .

[6]  Qing Gu,et al.  DP-Share: Privacy-Preserving Software Defect Prediction Model Sharing Through Differential Privacy , 2019, Journal of Computer Science and Technology.

[7]  Shane McIntosh,et al.  Automated Parameter Optimization of Classification Techniques for Defect Prediction Models , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[8]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[9]  Sinno Jialin Pan,et al.  Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[10]  Naoyasu Ubayashi,et al.  An empirical study of just-in-time defect prediction using cross-project models , 2014, MSR 2014.

[11]  Xiang Chen,et al.  Empirical Studies of a Two-Stage Data Preprocessing Approach for Software Fault Prediction , 2014, IEEE Transactions on Reliability.

[12]  Brian Henderson-Sellers,et al.  Object-Oriented Metrics , 1995, TOOLS.

[13]  Tim Menzies,et al.  Local vs. global models for effort estimation and defect prediction , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[14]  Zan Wang,et al.  Large-Scale Empirical Studies on Effort-Aware Security Vulnerability Prediction Methods , 2020, IEEE Transactions on Reliability.

[15]  Jongmoon Baik,et al.  A transfer cost-sensitive boosting approach for cross-project defect prediction , 2017, Software Quality Journal.

[16]  Xiao-Yuan Jing,et al.  Label propagation based semi-supervised learning for software defect prediction , 2016, Automated Software Engineering.

[17]  Xiao-Yuan Jing,et al.  Cross-Project and Within-Project Semisupervised Software Defect Prediction: A Unified Approach , 2018, IEEE Transactions on Reliability.

[18]  Xiao Liu,et al.  An empirical study on software defect prediction with a simplified metric set , 2014, Inf. Softw. Technol..

[19]  Qinbao Song,et al.  A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[20]  Lech Madeyski,et al.  Towards identifying software project clusters with regard to defect prediction , 2010, PROMISE '10.

[21]  Deepak Goyal,et al.  A hierarchical model for object-oriented design quality assessment , 2015 .

[22]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[23]  Chao Liu,et al.  A two-phase transfer learning model for cross-project defect prediction , 2019, Inf. Softw. Technol..

[24]  Burak Turhan,et al.  A Systematic Literature Review and Meta-Analysis on Cross Project Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[25]  Meng Liu,et al.  Do different cross‐project defect prediction methods identify the same defective modules? , 2019, J. Softw. Evol. Process..

[26]  David Lo,et al.  HYDRA: Massively Compositional Model for Cross-Project Defect Prediction , 2016, IEEE Transactions on Software Engineering.

[27]  Jin Liu,et al.  Learning from Imbalanced Data for Predicting the Number of Software Defects , 2017, 2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE).

[28]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[29]  Tim Menzies,et al.  Heterogeneous Defect Prediction , 2018, IEEE Trans. Software Eng..

[30]  Jongmoon Baik,et al.  Value-cognitive boosting with a support vector machine for cross-project defect prediction , 2014, Empirical Software Engineering.

[31]  Naoyasu Ubayashi,et al.  Studying just-in-time defect prediction using cross-project models , 2015, Empirical Software Engineering.

[32]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[33]  Yuxiang Shen,et al.  An empirical study on pareto based multi-objective feature selection for software defect prediction , 2019, J. Syst. Softw..

[34]  Daoxu Chen,et al.  A Cluster Based Feature Selection Method for Cross-Project Software Defect Prediction , 2017, Journal of Computer Science and Technology.

[35]  Mohamed Abdelrazek,et al.  An Ensemble Oversampling Model for Class Imbalance Problem in Software Defect Prediction , 2018, IEEE Access.

[36]  Xiang Chen,et al.  Software defect number prediction: Unsupervised vs supervised methods , 2019, Inf. Softw. Technol..

[37]  Andrea De Lucia,et al.  Cross-project defect prediction models: L'Union fait la force , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[38]  Ye Yang,et al.  An investigation on the feasibility of cross-project defect prediction , 2012, Automated Software Engineering.

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  Yuming Zhou,et al.  Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models , 2016, SIGSOFT FSE.

[41]  Jian Li,et al.  Software Defect Prediction via Convolutional Neural Network , 2017, 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[42]  Andreas Zeller,et al.  When do changes induce fixes? , 2005, ACM SIGSOFT Softw. Eng. Notes.

[43]  Xiang Chen,et al.  FECAR: A Feature Selection Framework for Software Defect Prediction , 2014, 2014 IEEE 38th Annual Computer Software and Applications Conference.

[44]  Haruhiko Kaiya,et al.  Adapting a fault prediction model to allow inter languagereuse , 2008, PROMISE '08.

[45]  Uirá Kulesza,et al.  A Framework for Evaluating the Results of the SZZ Approach for Identifying Bug-Introducing Changes , 2017, IEEE Transactions on Software Engineering.

[46]  Xiang Chen,et al.  MULTI: Multi-objective effort-aware just-in-time software defect prediction , 2018, Inf. Softw. Technol..

[47]  Yong Li,et al.  Evaluating Data Filter on Cross-Project Defect Prediction: Comparison and Improvements , 2017, IEEE Access.

[48]  Shujuan Jiang,et al.  An Empirical Study on the Effectiveness of Feature Selection for Cross-Project Defect Prediction , 2019, IEEE Access.

[49]  Baowen Xu,et al.  Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning , 2015, ESEC/SIGSOFT FSE.

[50]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[51]  Baowen Xu,et al.  Heterogeneous defect prediction with two-stage ensemble learning , 2019, Automated Software Engineering.

[52]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[53]  Tim Menzies,et al.  Better cross company defect prediction , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[54]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[55]  Tim Menzies,et al.  Bellwethers: A Baseline Method for Transfer Learning , 2017, IEEE Transactions on Software Engineering.

[56]  Sousuke Amasaki,et al.  Improving Cross-Project Defect Prediction Methods with Data Simplification , 2015, 2015 41st Euromicro Conference on Software Engineering and Advanced Applications.

[57]  Bruce Christianson,et al.  Software defect prediction using static code metrics underestimates defect-proneness , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[58]  Iyad Rahwan,et al.  Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm , 2017, EMNLP.

[59]  Ming Zhou,et al.  Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[60]  Song Wang,et al.  Automatically Learning Semantic Features for Defect Prediction , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[61]  Premkumar T. Devanbu,et al.  Recalling the "imprecision" of cross-project defect prediction , 2012, SIGSOFT FSE.

[62]  Xiang Chen,et al.  A Two-Stage Data Preprocessing Approach for Software Fault Prediction , 2014, 2014 Eighth International Conference on Software Security and Reliability.

[63]  Premkumar T. Devanbu,et al.  How, and why, process metrics are better , 2013, 2013 35th International Conference on Software Engineering (ICSE).