An Analytics-Driven Approach to Identify Duplicate Bug Records in Large Data Repositories

Typically, the identification and analysis of duplicate bug records of a software application are mundane activities, carried out by software maintenance engineers. As the bug repository grows in size for a large software application, this manual process becomes erroneous and a time-consuming activity. Automatic detection of these duplicate bug records will reduce the manual effort spent by the maintenance engineers. It also results in the reduction of costs of software maintenance. There are two types of duplicate bug records: (1) the records that describe the same problem using similar vocabulary, and (2) the records that describe different problems using dissimilar vocabulary but share the same underlying root cause. Each of these types of records needs a different set of techniques to identify the duplicate bug records. In this chapter, we explain the various machine learning techniques that are used to detect both types of duplicate bug records. Some of these duplicate bug records reappear, that is, they show up continuously over a long period of time. Here, we present a framework that can be used to automate the entire process of detection of both types of duplicates and recurring bug records. Using the framework, we conducted empirical studies on the open-source Chrome bug data records that are accessible online and the results are reported.

[1]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[2]  Ashish Sureka,et al.  Detecting Duplicate Bug Report Using Character N-Gram-Based Features , 2010, 2010 Asia Pacific Software Engineering Conference.

[3]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[4]  Jeff Sutherland,et al.  Business objects in corporate information systems , 1995, CSUR.

[5]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[6]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[7]  David Lo,et al.  Improved Duplicate Bug Report Identification , 2012, 2012 16th European Conference on Software Maintenance and Reengineering.

[8]  Nicholas Jalbert,et al.  Automated duplicate detection for bug tracking systems , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[9]  Eleni Stroulia,et al.  A contextual approach towards more accurate duplicate bug report detection and ranking , 2013, Empirical Software Engineering.

[10]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[11]  Aniello Cimitile,et al.  Prolog for Software Maintenance , 1995, SEKE.

[12]  Gail C. Murphy,et al.  Coping with an open bug repository , 2005, eclipse '05.

[13]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[14]  Siau-Cheng Khoo,et al.  Towards more accurate retrieval of duplicate bug reports , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[15]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[16]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.