Compositional Vector Space Models for Improved Bug Localization

Software developers and maintainers often need to locate code units responsible for a particular bug. A number of Information Retrieval (IR) techniques have been proposed to map natural language bug descriptions to the associated code units. The vector space model (VSM) with the standard tf-idf weighting scheme (VSMnatural), has been shown to outperform nine other state-of-the-art IR techniques. However, there are multiple VSM variants with different weighting schemes, and their relative performance differs for different software systems. Based on this observation, we propose to compose various VSM variants, modelling their composition as an optimization problem. We propose a genetic algorithm (GA) based approach to explore the space of possible compositions and output a heuristically near-optimal composite model. We have evaluated our approach against several baselines on thousands of bug reports from AspectJ, Eclipse, and SWT. On average, our approach (VSM composite) improves hit at 5 (Hit@5), mean average precision (MAP), and mean reciprocal rank (MRR) scores of VSMnatural by 18.4%, 20.6%, and 10.5% respectively. We also integrate our compositional model with AmaLgam, which is a state-of-art bug localization technique. The resultant model named AmaLgamcomposite on average can improve Hit@5, MAP, and MRR scores of AmaLgam by 8.0%, 14.4% and 6.5% respectively.

[1]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[2]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[5]  Thomas A. Standish An Essay on Software Reuse , 1984, IEEE Transactions on Software Engineering.

[6]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[7]  Ted J. Biggerstaff,et al.  The concept assignment problem in program understanding , 1993, [1993] Proceedings Working Conference on Reverse Engineering.

[8]  Thomas Bäck,et al.  Evolutionary algorithms in theory and practice - evolution strategies, evolutionary programming, genetic algorithms , 1996 .

[9]  Emden R. Gansner,et al.  Bunch: a clustering tool for the recovery and maintenance of software system structures , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[10]  Giuliano Antoniol,et al.  Recovering Traceability Links between Code and Documentation , 2002, IEEE Trans. Software Eng..

[11]  G. Engels,et al.  HANDBOOK OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING , 2002 .

[12]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[13]  Rainer Koschke,et al.  Locating Features in Source Code , 2003, IEEE Trans. Software Eng..

[14]  Mayur Naik,et al.  From symptom to cause: localizing errors in counterexample traces , 2003, POPL '03.

[15]  Andrian Marcus,et al.  Recovering documentation-to-source-code traceability links using latent semantic indexing , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[16]  M. Harman,et al.  A robust search-based approach to project management in the presence of abandonment, rework, error and uncertainty , 2004, 10th International Symposium on Software Metrics, 2004. Proceedings..

[17]  Mark Harman,et al.  10 th International Software Metrics Symposium (METRICS 2004) , 2004 .

[18]  Paolo Tonella,et al.  Evolutionary testing of classes , 2004, ISSTA '04.

[19]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[20]  Mary Jean Harrold,et al.  Empirical evaluation of the tarantula automatic fault-localization technique , 2005, ASE.

[21]  Michael Rovatsos,et al.  Handbook of Software Engineering and Knowledge Engineering , 2005 .

[22]  Yann-Gaël Guéhéneuc,et al.  Feature identification: a novel approach and a case study , 2005, 21st IEEE International Conference on Software Maintenance (ICSM'05).

[23]  Andrew David Eisenberg,et al.  Dynamic feature traces: finding features in unfamiliar code , 2005, 21st IEEE International Conference on Software Maintenance (ICSM'05).

[24]  Kiarash Mahdavi,et al.  Allowing Overlapping Boundaries in Source Code using a Search Based Approach to Concept Binding , 2006, 2006 22nd IEEE International Conference on Software Maintenance.

[25]  Yann-Gaël Guéhéneuc,et al.  Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval , 2007, IEEE Transactions on Software Engineering.

[26]  Mark Harman,et al.  Search Algorithms for Regression Test Case Prioritization , 2007, IEEE Transactions on Software Engineering.

[27]  A.J.C. van Gemund,et al.  On the Accuracy of Spectrum-based Fault Localization , 2007, Testing: Academic and Industrial Conference Practice and Research Techniques - MUTATION (TAICPART-MUTATION 2007).

[28]  Alfred V. Aho,et al.  CERBERUS: Tracing Requirements to Source Code Using Information Retrieval, Dynamic Analysis, and Program Analysis , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[29]  David Lo,et al.  Comprehensive evaluation of association measures for fault localization , 2010, 2010 IEEE International Conference on Software Maintenance.

[30]  Letha H. Etzkorn,et al.  Bug localization using latent Dirichlet allocation , 2010, Inf. Softw. Technol..

[31]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[32]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[33]  Bogdan Dit,et al.  Integrating information retrieval, execution and link analysis algorithms to improve feature location in software , 2012, Empirical Software Engineering.

[34]  Arie van Deursen,et al.  A Controlled Experiment for Program Comprehension through Trace Visualization , 2011, IEEE Transactions on Software Engineering.

[35]  David Lo,et al.  Search-based fault localization , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[36]  Zhenchang Xing,et al.  Concern Localization using Information Retrieval: An Empirical Study on Linux Kernel , 2011, 2011 18th Working Conference on Reverse Engineering.

[37]  Avinash C. Kak,et al.  Retrieval from software libraries for bug localization: a comparative study of generic and composite text models , 2011, MSR '11.

[38]  David Lo,et al.  Interactive fault localization leveraging simple user feedback , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[39]  Jian Zhou,et al.  Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[40]  Avinash C. Kak,et al.  Incorporating version histories in Information Retrieval based bug localization , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[41]  Sarfraz Khurshid,et al.  Improving bug localization using structured information retrieval , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[42]  David Lo,et al.  Empirical Evaluation of Bug Linking , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[43]  Bogdan Dit,et al.  Feature location in source code: a taxonomy and survey , 2013, J. Softw. Evol. Process..

[44]  Andrea De Lucia,et al.  How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[45]  David Lo,et al.  Fusion fault localizers , 2014, ASE.

[46]  David Lo,et al.  Extended comprehensive study of association measures for fault localization , 2014, J. Softw. Evol. Process..

[47]  David Lo,et al.  Version history, similar report, and structure: putting them together for improved bug localization , 2014, ICPC 2014.

[48]  W. Eric Wong,et al.  The DStar Method for Effective Software Fault Localization , 2014, IEEE Transactions on Reliability.