Cross-project Defect Prediction via ASTToken2Vec and BLSTM-based Neural Network

Cross-project defect prediction (CPDP) as a means to focus quality assurance of software projects was under heavy investigation in recent years. In this paper, we propose a novel CPDP approach via deep learning. In particular, we model each program module via simplified abstract syntax tree (S-AST). For each node in S-AST, only the project-independent node type is remained and other project-specific information (such as name of variable and method) is ignored, so that the modeling method is project-independent and suitable for CPDP issue. Then we extract token sequences from program modules modeled as S-AST. In addition, to construct meaningful vector representations for token sequences, we propose a novel unsupervised embedding method ASTToken2Vec, which learns semantic information from S-AST’s natural structure. Finally, we use BLSTM (bi-directional long short-term memory) based neural network to automatically learn semantic features from vectorized token sequences and construct CPDP models. In our empirical studies, 10 real large-scale open source Java projects are chosen as our empirical subjects. Final results show that our proposed CPDP approach can perform significantly better than 5 state-of-the-art CPDP baselines in terms of AUC.

[1]  Burak Turhan,et al.  A Systematic Literature Review and Meta-Analysis on Cross Project Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[2]  Jian Li,et al.  Software Defect Prediction via Convolutional Neural Network , 2017, 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[3]  Ying Zou,et al.  Data Transformation in Cross-project Defect Prediction , 2017, Empirical Software Engineering.

[4]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[5]  Jin Liu,et al.  Learning from Imbalanced Data for Predicting the Number of Software Defects , 2017, 2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE).

[6]  Jens Grabowski,et al.  A Comparative Study to Benchmark Cross-Project Defect Prediction Approaches , 2018, IEEE Transactions on Software Engineering.

[7]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[8]  Yang Liu,et al.  Proteus: computing disjunctive loop summary via path dependency analysis , 2016, SIGSOFT FSE.

[9]  Premkumar T. Devanbu,et al.  How, and why, process metrics are better , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[10]  Haruhiko Kaiya,et al.  Adapting a fault prediction model to allow inter languagereuse , 2008, PROMISE '08.

[11]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[12]  Sinno Jialin Pan,et al.  Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  David Lo,et al.  HYDRA: Massively Compositional Model for Cross-Project Defect Prediction , 2016, IEEE Transactions on Software Engineering.

[15]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[16]  Xiang Chen,et al.  Software defect number prediction: Unsupervised vs supervised methods , 2019, Inf. Softw. Technol..

[17]  Bruce Christianson,et al.  Software defect prediction using static code metrics underestimates defect-proneness , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[18]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[19]  Song Wang,et al.  Automatically Learning Semantic Features for Defect Prediction , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[20]  Tim Menzies,et al.  Better cross company defect prediction , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[21]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[22]  Yang Liu,et al.  Automatic Loop Summarization via Path Dependency Analysis , 2019, IEEE Transactions on Software Engineering.

[23]  Koichiro Ochimizu,et al.  Towards logistic regression models for predicting fault-prone code across software projects , 2009, ESEM 2009.

[24]  Lech Madeyski,et al.  Towards identifying software project clusters with regard to defect prediction , 2010, PROMISE '10.

[25]  Tim Menzies,et al.  Local vs. global models for effort estimation and defect prediction , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[26]  Jongmoon Baik,et al.  Value-cognitive boosting with a support vector machine for cross-project defect prediction , 2014, Empirical Software Engineering.

[27]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[28]  Daoxu Chen,et al.  A Cluster Based Feature Selection Method for Cross-Project Software Defect Prediction , 2017, Journal of Computer Science and Technology.

[29]  Qinbao Song,et al.  A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[30]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[31]  Xinli Yang,et al.  Deep Learning for Just-in-Time Defect Prediction , 2015, 2015 IEEE International Conference on Software Quality, Reliability and Security.