A Novel Technique for Duplicate Detection and Classification of Bug Reports

Software products are increasingly complex, so it is becoming more difficult to find and correct bugs in large programs. Software developers rely on bug reports to fix bugs; thus, bug-tracking tools have been introduced to allow developers to upload, manage, and comment on bug reports to guide corrective software maintenance. However, the very high frequency of duplicate bug reports means that the triagers who help software developers in eliminating bugs must allocate large amounts of time and effort to the identification and analysis of these bug reports. In addition, classifying bug reports can help triagers arrange bugs in categories for the fixers who have more experience for resolving historical bugs in the same category. Unfortunately, due to a large number of submitted bug reports every day, the manual classification for these bug reports increases the triagers’ workload. To resolve these problems, in this study, we develop a novel technique for automatic duplicate detection and classification of bug reports, which reduces the time and effort consumed by triagers for bug fixing. Our novel technique uses a support vector machine to check whether a new bug report is a duplicate. The concept profile is also used to classify the bug reports into related categories in a taxonomic tree. Finally, we conduct experiments that demonstrate the feasibility of our proposed approach using bug reports extracted from the large-scale open source project Mozilla. key words: bug report classification, concept profile, duplicate detection, support vector machine, software maintenance

[1]  Rahul Khanna,et al.  Support Vector Machines for Classification , 2015 .

[2]  I. Muchnik,et al.  Support Vector Machines for Classification , 2015 .

[3]  Shie-Jue Lee,et al.  Detecting near-duplicate documents using sentence-level features and supervised learning , 2013, Expert Syst. Appl..

[4]  Tao Zhang,et al.  A hybrid bug triage algorithm for developer recommendation , 2013, SAC '13.

[5]  David Lo,et al.  Duplicate bug report detection with a combination of information retrieval and topic modeling , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[6]  Tao Zhang,et al.  An Automated Bug Triage Approach: A Concept Profile and Social Network Based Developer Recommendation , 2012, ICIC.

[7]  He Jiang,et al.  Developer prioritization in bug repositories , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[8]  Siau-Cheng Khoo,et al.  Towards more accurate retrieval of duplicate bug reports , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[9]  Dmitri Loguinov,et al.  Probabilistic near-duplicate detection using simhash , 2011, CIKM '11.

[10]  Tao Zhang,et al.  A Bug Rule Based Technique with Feedback for Classifying Bug Reports , 2011, 2011 IEEE 11th International Conference on Computer and Information Technology.

[11]  Seung-won Hwang,et al.  CosTriage: A Cost-Aware Triage Algorithm for Bug Reporting Systems , 2011, AAAI.

[12]  Maosong Sun,et al.  Semi-Supervised SimHash for Efficient Document Similarity Search , 2011, ACL.

[13]  M. Watts,et al.  Determining factors that influence the dispersal of a pelagic species: A comparison between artificial neural networks and evolutionary algorithms , 2011 .

[14]  Avinash C. Kak,et al.  Retrieval from software libraries for bug localization: a comparative study of generic and composite text models , 2011, MSR '11.

[15]  Markus Neuhäuser,et al.  Wilcoxon Signed Rank Test , 2006 .

[16]  Ashish Sureka,et al.  Detecting Duplicate Bug Report Using Character N-Gram-Based Features , 2010, 2010 Asia Pacific Software Engineering Conference.

[17]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[18]  Thomas Zimmermann,et al.  What Makes a Good Bug Report? , 2008, IEEE Transactions on Software Engineering.

[19]  Silvio Romero de Lemos Meira,et al.  A Visual Bug Report Analysis and Search Tool , 2010, SEKE.

[20]  Simone Paolo Ponzetto,et al.  Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia , 2009, IJCAI.

[21]  Oscar Nierstrasz,et al.  Assigning bug reports using a vocabulary-based expertise model of developers , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[22]  Michael K. Ng,et al.  Agglomerative Fuzzy K-Means Clustering Algorithm with Selection of Number of Clusters , 2008, IEEE Transactions on Knowledge and Data Engineering.

[23]  Thomas Zimmermann,et al.  Duplicate bug reports considered harmful … really? , 2008, 2008 IEEE International Conference on Software Maintenance.

[24]  Nicholas Jalbert,et al.  Automated duplicate detection for bug tracking systems , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[25]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[26]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[27]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[28]  Gail C. Murphy,et al.  Determining Implementation Expertise from Bug Reports , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[29]  Gail C. Murphy,et al.  Who should fix this bug? , 2006, ICSE.

[30]  Alain Abran,et al.  Software Maintenance Maturity Model (SMmm): the software maintenance process model , 2005, J. Softw. Maintenance Res. Pract..

[31]  Éric Gaussier,et al.  A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.

[32]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[33]  Ramesh Nallapati,et al.  Discriminative models for information retrieval , 2004, SIGIR '04.

[34]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[35]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[36]  Benjamin S. Bloom,et al.  A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom's Taxonomy of Educational Objectives , 2000 .