Seml: A Semantic LSTM Model for Software Defect Prediction

Software defect prediction can assist developers in finding potential bugs and reducing maintenance cost. Traditional approaches usually utilize software metrics (Lines of Code, Cyclomatic Complexity, etc.) as features to build classifiers and identify defective software modules. However, software metrics often fail to capture programs’ syntax and semantic information. In this paper, we propose Seml, a novel framework that combines word embedding and deep learning methods for defect prediction. Specifically, for each program source file, we first extract a token sequence from its abstract syntax tree. Then, we map each token in the sequence to a real-valued vector using a mapping table, which is trained with an unsupervised word embedding model. Finally, we use the vector sequences and their labels (defective or non-defective) to build a Long Short Term Memory (LSTM) network. The LSTM model can automatically learn the semantic information of programs and perform defect prediction. The evaluation results on eight open source projects show that Seml outperforms three state-of-the-art defect prediction approaches on most of the datasets for both within-project defect prediction and cross-project defect prediction.

[1]  John G. Breslin,et al.  A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis , 2016, EMNLP.

[2]  Baowen Xu,et al.  Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning , 2015, ESEC/SIGSOFT FSE.

[3]  Rahul Gupta,et al.  DeepFix: Fixing Common C Language Errors by Deep Learning , 2017, AAAI.

[4]  Sashank Dara,et al.  Online Defect Prediction for Imbalanced Data , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[5]  Geoffrey E. Hinton Deep Belief Nets , 2017, Encyclopedia of Machine Learning and Data Mining.

[6]  Shujuan Jiang,et al.  A feature matching and transfer approach for cross-company defect prediction , 2017, J. Syst. Softw..

[7]  N. Cliff Dominance statistics: Ordinal analyses to answer ordinal questions. , 1993 .

[8]  Anh Viet Phan,et al.  Convolutional neural networks on assembly code for predicting software defects , 2017, 2017 21st Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES).

[9]  Fabio Palomba,et al.  Fine-grained just-in-time defect prediction , 2019, J. Syst. Softw..

[10]  Yuming Zhou,et al.  Code Churn: A Neglected Metric in Effort-Aware Just-in-Time Defect Prediction , 2017, 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[11]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[12]  Yutaka Matsuo,et al.  Learning Feature Representations from Change Dependency Graphs for Defect Prediction , 2017, 2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE).

[13]  Song Wang,et al.  Automatically Learning Semantic Features for Defect Prediction , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[14]  Xiao-Yuan Jing,et al.  Progress on approaches to software defect prediction , 2018, IET Softw..

[15]  Baowen Xu,et al.  Cost-sensitive transfer kernel canonical correlation analysis for heterogeneous defect prediction , 2018, Automated Software Engineering.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Xiao-Yuan Jing,et al.  Cross-Project and Within-Project Semisupervised Software Defect Prediction: A Unified Approach , 2018, IEEE Transactions on Reliability.

[18]  Aditya K. Ghose,et al.  Automatic feature learning for vulnerability prediction , 2017, ArXiv.

[19]  Shouhuai Xu,et al.  VulDeePecker: A Deep Learning-Based System for Vulnerability Detection , 2018, NDSS.

[20]  Xiao-Yuan Jing,et al.  On the Multiple Sources and Privacy Preservation Issues for Heterogeneous Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[21]  Chengjie Sun,et al.  LSTM-CRF for Drug-Named Entity Recognition , 2017, Entropy.

[22]  Aditya K. Ghose,et al.  A deep tree-based model for software defect prediction , 2018, ArXiv.

[23]  Chris F. Kemerer,et al.  A Metrics Suite for Object Oriented Design , 2015, IEEE Trans. Software Eng..

[24]  Zhiyong Feng,et al.  LSTM with sentence representations for document-level sentiment classification , 2018, Neurocomputing.

[25]  S. Nickolas,et al.  Feature Selection Using Decision Tree Induction in Class level Metrics Dataset for Software Defect Predictions , 2010 .

[26]  Sinno Jialin Pan,et al.  Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[27]  Anh Tuan Nguyen,et al.  Combining Deep Learning with Information Retrieval to Localize Buggy Files for Bug Reports (N) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[28]  Richard H. Carver,et al.  An Evaluation of the MOOD Set of Object-Oriented Software Metrics , 1998, IEEE Trans. Software Eng..

[29]  Tao Wang,et al.  Naive Bayes Software Defect Prediction Model , 2010, 2010 International Conference on Computational Intelligence and Software Engineering.

[30]  Bruce Christianson,et al.  Using the Support Vector Machine as a Classification Method for Software Defect Prediction with Static Code Metrics , 2009, EANN.

[31]  Bart Baesens,et al.  Evaluating software defect prediction performance: an updated benchmarking study , 2019, SSRN Electronic Journal.

[32]  Xiong Xiao,et al.  A Bidirectional LSTM Approach with Word Embeddings for Sentence Boundary Detection , 2017, Journal of Signal Processing Systems.

[33]  Jane Cleland-Huang,et al.  Semantically Enhanced Software Traceability Using Deep Learning Techniques , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[34]  Geoffrey Zweig,et al.  Spoken language understanding using long short-term memory neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[35]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36]  Shane McIntosh,et al.  Are Fix-Inducing Changes a Moving Target? A Longitudinal Case Study of Just-In-Time Defect Prediction , 2018, IEEE Transactions on Software Engineering.

[37]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[38]  Bin Liu,et al.  Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning , 2017, Inf. Softw. Technol..

[39]  Sousuke Amasaki,et al.  A Bayesian belief network for assessing the likelihood of fault content , 2003, 14th International Symposium on Software Reliability Engineering, 2003. ISSRE 2003..

[40]  Baowen Xu,et al.  An Improved SDA Based Defect Prediction Framework for Both Within-Project and Cross-Project Class-Imbalance Problems , 2017, IEEE Transactions on Software Engineering.

[41]  Jian Li,et al.  Software Defect Prediction via Convolutional Neural Network , 2017, 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[42]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[43]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[44]  Xinli Yang,et al.  TLEL: A two-layer ensemble learning approach for just-in-time defect prediction , 2017, Inf. Softw. Technol..

[45]  Jin Liu,et al.  Dictionary learning based software defect prediction , 2014, ICSE.

[46]  Hideaki Hata,et al.  Cross project defect prediction using class distribution estimation and oversampling , 2018, Inf. Softw. Technol..

[47]  Zhi-Hua Zhou,et al.  Learning Unified Features from Natural and Programming Languages for Locating Buggy Source Code , 2016, IJCAI.

[48]  Tian Jiang,et al.  Personalized defect prediction , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[49]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[50]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[51]  Geoffrey E. Hinton,et al.  Learning distributed representations of concepts. , 1989 .

[52]  Rudolf Ferenc,et al.  A Public Bug Database of GitHub Projects and Its Application in Bug Prediction , 2016, ICCSA.

[53]  Xiang Chen,et al.  MULTI: Multi-objective effort-aware just-in-time software defect prediction , 2018, Inf. Softw. Technol..