Preventing duplicate bug reports by continuously querying bug reports

Bug deduplication or duplicate bug report detection is a hot topic in software engineering information retrieval research, but it is often not deployed. Typically to de-duplicate bug reports developers rely upon the search capabilities of the bug report software they employ, such as Bugzilla, Jira, or Github Issues. These search capabilities range from simple SQL string search to IR-based word indexing methods employed by search engines. Yet too often these searches do very little to stop the creation of duplicate bug reports. Some bug trackers have more than 10% of their bug reports marked as duplicate. Perhaps these bug tracker search engines are not enough? In this paper we propose a method of attempting to prevent duplicate bug reports before they start: continuously querying. That is as the bug reporter types in their bug report their text is used to query the bug database to find duplicate or related bug reports. This continuously querying bug reports allows the reporter to be alerted to duplicate bug reports as they report the bug, rather than formulating queries to find the duplicate bug report. Thus this work ushers in a new way of evaluating bug report deduplication techniques, as well as a new kind of bug deduplication task. We show that simple IR measures can address this problem but also that further research is needed to refine this novel process that is integrate-able into modern bug report systems.

[1]  David Lo,et al.  Duplicate bug report detection with a combination of information retrieval and topic modeling , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[2]  Thomas Zimmermann,et al.  Duplicate bug reports considered harmful … really? , 2008, 2008 IEEE International Conference on Software Maintenance.

[3]  Gary Marchionini,et al.  Examining the effectiveness of real-time query expansion , 2007, Inf. Process. Manag..

[4]  Chanchal Kumar Roy,et al.  CSCC: Simple, Efficient, Context Sensitive Code Completion , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[5]  Cor-Paul Bezemer,et al.  Revisiting the Performance Evaluation of Automated Approaches for the Retrieval of Duplicate Issue Reports , 2018, IEEE Transactions on Software Engineering.

[6]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[7]  Nicholas Jalbert,et al.  Automated duplicate detection for bug tracking systems , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[8]  Hector Garcia-Molina,et al.  An Overview of Real-Time Database Systems , 1995, NATO ASI RTC.

[9]  Ashish Sureka,et al.  Detecting Duplicate Bug Report Using Character N-Gram-Based Features , 2010, 2010 Asia Pacific Software Engineering Conference.

[10]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[11]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[12]  Yann-Gaël Guéhéneuc,et al.  22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER) , 2015 .

[13]  Abdelwahab Hamou-Lhadj,et al.  DURFEX: A Feature Extraction Technique for Efficient Detection of Duplicate Bug Reports , 2017, 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[14]  K. M. Annervaz,et al.  Towards Accurate Duplicate Bug Retrieval Using Deep Learning Techniques , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[15]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[16]  Eleni Stroulia,et al.  Detecting duplicate bug reports with software engineering domain knowledge , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[17]  Sonia Haiduc Supporting Query Formulation for Text Retrieval Applications in Software Engineering , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[18]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[19]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[20]  Andrea De Lucia,et al.  Parameterizing and Assembling IR-Based Solutions for SE Tasks Using Genetic Algorithms , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[21]  Bonita Sharif,et al.  Improving the accuracy of duplicate bug report detection using textual similarity measures , 2014, MSR 2014.

[22]  Michele Lanza,et al.  Seahawk: Stack Overflow in the IDE , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[23]  Eleni Stroulia,et al.  A contextual approach towards more accurate duplicate bug report detection and ranking , 2013, Empirical Software Engineering.

[24]  Birgit Penzenstadler,et al.  Editorial: Reality check for software engineering for sustainability—pragmatism required , 2017, J. Softw. Evol. Process..

[25]  Siau-Cheng Khoo,et al.  Towards more accurate retrieval of duplicate bug reports , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[26]  Joseph M. Hellerstein,et al.  Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[27]  Letha H. Etzkorn,et al.  Source Code Retrieval for Bug Localization Using Latent Dirichlet Allocation , 2008, 2008 15th Working Conference on Reverse Engineering.

[28]  Cor-Paul Bezemer,et al.  Revisiting the performance of automated approaches for the retrieval of duplicate reports in issue tracking systems that perform just-in-time duplicate retrieval , 2017, Empirical Software Engineering.

[29]  Nicholas A. Kraft,et al.  New features for duplicate bug detection , 2014, MSR 2014.

[30]  Yuanyuan Zhang,et al.  Search-based software engineering: Trends, techniques and applications , 2012, CSUR.

[31]  Michael J. Franklin,et al.  Streaming Queries over Streaming Data , 2002, VLDB.

[32]  Marco Tulio Valente,et al.  NextBug: a Bugzilla extension for recommending similar bugs , 2015, Journal of Software Engineering Research and Development.

[33]  Abram Hindle,et al.  The Unreasonable Effectiveness of Traditional Information Retrieval in Crash Report Deduplication , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[34]  Ahmed E. Hassan,et al.  Studying the needed effort for identifying duplicate issues , 2015, Empirical Software Engineering.

[35]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[36]  David Lo,et al.  Multi-Factor Duplicate Question Detection in Stack Overflow , 2015, Journal of Computer Science and Technology.

[37]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[38]  Gabriele Bavota,et al.  Mining StackOverflow to turn the IDE into a self-confident programming prompter , 2014, MSR 2014.

[39]  David Lo,et al.  Compositional Vector Space Models for Improved Bug Localization , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[40]  Patrícia Duarte de Lima Machado,et al.  Revealing influence of model structure and test case profile on the prioritization of test cases in the context of model-based testing , 2014, Journal of Software Engineering Research and Development.

[41]  David Lo,et al.  DupFinder: integrated tool support for duplicate bug report detection , 2014, ASE.