A contextual approach towards more accurate duplicate bug report detection and ranking

Bug-tracking and issue-tracking systems tend to be populated with bugs, issues, or tickets written by a wide variety of bug reporters, with different levels of training and knowledge about the system being discussed. Many bug reporters lack the skills, vocabulary, knowledge, or time to efficiently search the issue tracker for similar issues. As a result, issue trackers are often full of duplicate issues and bugs, and bug triaging is time consuming and error prone. Many researchers have approached the bug-deduplication problem using off-the-shelf information-retrieval tools, such as BM25F used by Sun et al. In our work, we extend the state of the art by investigating how contextual information, relying on our prior knowledge of software quality, software architecture, and system-development (LDA) topics, can be exploited to improve bug-deduplication. We demonstrate the effectiveness of our contextual bug-deduplication method on the bug repository of the Android ecosystem. Based on this experience, we conclude that researchers should not ignore the context of software engineering when using IR tools for deduplication.

[1]  David W. Binkley,et al.  Maintenance and Evolution: Information Retrieval Applications , 2010, Encyclopedia of Software Engineering.

[2]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[3]  Siau-Cheng Khoo,et al.  Towards more accurate retrieval of duplicate bug reports , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[4]  Thomas Zimmermann,et al.  Towards the next generation of bug tracking systems , 2008, 2008 IEEE Symposium on Visual Languages and Human-Centric Computing.

[5]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[6]  J. Pfanzagl Parametric Statistical Theory , 1994 .

[7]  Naohiro Ishii,et al.  Analysis of software bug causes and its prevention , 1999, Inf. Softw. Technol..

[8]  Ashish Sureka,et al.  Detecting Duplicate Bug Report Using Character N-Gram-Based Features , 2010, 2010 Asia Pacific Software Engineering Conference.

[9]  William Pugh,et al.  The Google FindBugs fixit , 2010, ISSTA '10.

[10]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[11]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[12]  Bernhard Ganter,et al.  Formal Concept Analysis , 2013 .

[13]  Nicholas Jalbert,et al.  Automated duplicate detection for bug tracking systems , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[14]  Thomas Zimmermann,et al.  What Makes a Good Bug Report? , 2008, IEEE Transactions on Software Engineering.

[15]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[16]  Andrian Marcus,et al.  An information retrieval approach to concept location in source code , 2004, 11th Working Conference on Reverse Engineering.

[17]  Andrian Marcus,et al.  Supporting program comprehension using semantic and structural information , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.

[18]  Sudheendra Hangal,et al.  Tracking down software bugs using automatic anomaly detection , 2002, ICSE '02.

[19]  Lyndon Hiew,et al.  Assisted Detection of Duplicate Bug Reports , 2006 .

[20]  Michael W. Godfrey,et al.  Automated topic naming to support cross-project analysis of software maintenance activities , 2011, MSR '11.

[21]  Pradeep Singh,et al.  Weight similarity measurement model based, object oriented approach for bug databases mining to detect similar and duplicate bugs , 2009, ICAC3 '09.

[22]  Krzysztof Czarnecki,et al.  Modelling the 'Hurried' bug report reading process to summarize bug reports , 2012, ICSM.

[23]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[24]  Trevor Darrell,et al.  Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing) , 2006 .

[25]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[26]  David Lo,et al.  Duplicate bug report detection with a combination of information retrieval and topic modeling , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[27]  Eleni Stroulia,et al.  Do the stars align? Multidimensional analysis of Android's layered architecture , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[28]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[29]  Thomas Zimmermann,et al.  Duplicate bug reports considered harmful … really? , 2008, 2008 IEEE International Conference on Software Maintenance.

[30]  Nicolás Serrano,et al.  Bugzilla, ITracker, and Other Bug Trackers , 2005, IEEE Softw..

[31]  Gail C. Murphy,et al.  Who should fix this bug? , 2006, ICSE.

[32]  Lerina Aversano,et al.  Learning from bug-introducing changes to prevent fault prone code , 2007, IWPSE '07.

[33]  Jeff Sutherland,et al.  Business objects in corporate information systems , 1995, CSUR.

[34]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[35]  Tibor Gyimóthy,et al.  Using information retrieval based coupling measures for impact analysis , 2009, Empirical Software Engineering.

[36]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[37]  Santosh S. Vempala,et al.  Latent Semantic Indexing , 2000, PODS 2000.

[38]  Gail C. Murphy,et al.  Coping with an open bug repository , 2005, eclipse '05.

[39]  Santonu Sarkar,et al.  Mining business topics in source code using latent dirichlet allocation , 2008, ISEC '08.

[40]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[41]  D. Lindley Regression and Correlation Analysis , 1990 .

[42]  Eleni Stroulia,et al.  Understanding Android Fragmentation with Topic Analysis of Vendor-Specific Bugs , 2012, 2012 19th Working Conference on Reverse Engineering.

[43]  Denys Poshyvanyk,et al.  Combining Formal Concept Analysis with Information Retrieval for Concept Location in Source Code , 2007, 15th IEEE International Conference on Program Comprehension (ICPC '07).

[44]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[45]  Wei Zhao,et al.  SNIAFL: towards a static non-interactive approach to feature location , 2004, Proceedings. 26th International Conference on Software Engineering.

[46]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.