Improving bug localization with word embedding and enhanced convolutional neural networks

Abstract Context: Automatic localization of buggy files can speed up the process of bug fixing to improve the efficiency and productivity of software quality assurance teams. Useful semantic information is available in bug reports and source code, but it is usually underutilized by existing bug localization approaches. Objective: To improve the performance of bug localization, we propose DeepLoc, a novel deep learning-based model that makes full use of semantic information. Method: DeepLoc is composed of an enhanced convolutional neural network (CNN) that considers bug-fixing recency and frequency, together with word-embedding and feature-detecting techniques. DeepLoc uses word embeddings to represent the words in bug reports and source files that retain their semantic information, and different CNNs to detect features from them. DeepLoc is evaluated on over 18,500 bug reports extracted from AspectJ, Eclipse, JDT, SWT, and Tomcat projects. Results: The experimental results show that DeepLoc achieves 10.87%–13.4% higher MAP (mean average precision) than conventional CNN. DeepLoc outperforms four current state-of-the-art approaches (DeepLocator, HyLoc, LR+WE, and BugLocator) in terms of Accuracy@k (the percentage of bug reports for which at least one real buggy file is located within the top k rank), MAP, and MRR (mean reciprocal rank) using less computation time. Conclusion: DeepLoc is capable of automatically connecting bug reports to the corresponding buggy files and achieves better performance than four state-of-the-art approaches based on a deep understanding of semantics in bug reports and source code.

[1]  David W. Binkley,et al.  To CamelCase or under_score , 2019, 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC).

[2]  Claes Wohlin,et al.  Experimentation in Software Engineering , 2012, Springer Berlin Heidelberg.

[3]  Anh Tuan Nguyen,et al.  Combining Deep Learning with Information Retrieval to Localize Buggy Files for Bug Reports (N) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Martin White,et al.  Toward Deep Learning Software Repositories , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[6]  Yan Xiao,et al.  Machine translation-based bug localization technique for bridging lexical gap , 2018, Inf. Softw. Technol..

[7]  Ákos Horváth,et al.  EMF-IncQuery: An integrated development environment for live model queries , 2015, Sci. Comput. Program..

[8]  Jian Zhou,et al.  Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[9]  Yan Xiao,et al.  Improving Bug Localization with an Enhanced Convolutional Neural Network , 2017, 2017 24th Asia-Pacific Software Engineering Conference (APSEC).

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[12]  Zhi Jin,et al.  Building Program Vector Representations for Deep Learning , 2014, KSEM.

[13]  Ray Denenberg,et al.  Report from the Joint W3C/IETF URI Planning Interest Group: Uniform Resource Identifiers (URIs), URLs, and Uniform Resource Names (URNs): Clarifications and Recommendations , 2002, RFC.

[14]  Razvan C. Bunescu,et al.  Learning to rank relevant files for bug reports using domain knowledge , 2014, SIGSOFT FSE.

[15]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[16]  Sarfraz Khurshid,et al.  Improving bug localization using structured information retrieval , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[17]  Tim Menzies,et al.  A Deep Learning Model for Estimating Story Points , 2016, IEEE Transactions on Software Engineering.

[18]  Premkumar T. Devanbu,et al.  How, and why, process metrics are better , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[19]  Anh Tuan Nguyen,et al.  Graph-Based Statistical Language Model for Code , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[20]  Hinrich Schütze,et al.  Introduction to Information Retrieval: Scoring, term weighting, and the vector space model , 2008 .

[21]  Zhendong Su,et al.  On the naturalness of software , 2012, ICSE 2012.

[22]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[23]  Mary Jean Harrold,et al.  Empirical evaluation of the tarantula automatic fault-localization technique , 2005, ASE.

[24]  Hung Viet Nguyen,et al.  A topic-based approach for narrowing the search space of buggy files from a bug report , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[25]  Chao Liu,et al.  Statistical Debugging: A Hypothesis Testing-Based Approach , 2006, IEEE Transactions on Software Engineering.

[26]  Peter Zoeteweij,et al.  A practical evaluation of spectrum-based fault localization , 2009, J. Syst. Softw..

[27]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[28]  Lu Zhang,et al.  Boosting Bug-Report-Oriented Fault Localization with Segmentation and Stack-Trace Analysis , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[29]  Xiaodong Gu,et al.  Deep API learning , 2016, SIGSOFT FSE.

[30]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[31]  Witold Pedrycz,et al.  A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[32]  Thomas D. LaToza,et al.  Hard-to-answer questions about code , 2010, PLATEAU '10.

[33]  Xiao Ma,et al.  From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[34]  Sai Zhang,et al.  Software bug localization with markov logic , 2014, ICSE Companion.

[35]  Andreas Zeller,et al.  Where Should We Fix This Bug? A Two-Phase Recommendation Model , 2013, IEEE Transactions on Software Engineering.

[36]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[37]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[38]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[39]  Kunihiko Fukushima,et al.  Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Visual Pattern Recognition , 1982 .

[40]  Bela Gipp,et al.  TF-IDuF : A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections , 2017 .

[41]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[42]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[43]  Bogdan Kwolek,et al.  Face Detection Using Convolutional Neural Networks and Gabor Filters , 2005, ICANN.

[44]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[45]  Song Wang,et al.  Automatically Learning Semantic Features for Defect Prediction , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).