Combining Text Mining and Data Mining for Bug Report Classification

Misclassification of bug reports inevitably sacrifices the performance of bug prediction models. Manual examinations can help reduce the noise but bring a heavy burden for developers instead. In this paper, we propose a hybrid approach by combining both text mining and data mining techniques of bug report data to automate the prediction process. The first stage leverages text mining techniques to analyze the summary parts of bug reports and classifies them into three levels of probability. The extracted features and some other structured features of bug reports are then fed into the machine learner in the second stage. Data grafting techniques are employed to bridge the two stages. Comparative experiments with previous studies on the same data -- three large-scale open source projects -- consistently achieve a reasonable enhancement (from 77.4% to 81.7%, 73.9% to 80.2% and 87.4% to 93.7%, respectively) over their best results in terms of overall performance. Additional comparative empirical experiments on other two popular open source repositories confirm the findings and demonstrate the benefits of our approach.

[1]  Andreas Zeller,et al.  Predicting faults from cached history , 2008, ISEC '08.

[2]  Oscar Nierstrasz,et al.  Assigning bug reports using a vocabulary-based expertise model of developers , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[3]  Ken-ichi Matsumoto,et al.  Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling , 2013, 2013 20th Asia-Pacific Software Engineering Conference (APSEC).

[4]  David Lo,et al.  DRONE: Predicting Priority of Reported Bugs by Multi-factor Analysis , 2013, ICSM.

[5]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[6]  K. K. Chaturvedi,et al.  Determining Bug severity using machine learning techniques , 2012, 2012 CSI Sixth International Conference on Software Engineering (CONSEG).

[7]  Andreas Zeller,et al.  It's not a bug, it's a feature: How misclassification impacts bug prediction , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[8]  Sunghun Kim,et al.  Reducing Features to Improve Code Change-Based Bug Prediction , 2013, IEEE Transactions on Software Engineering.

[9]  Foutse Khomh,et al.  Is it a bug or an enhancement?: a text-based approach to classify change requests , 2008, CASCON '08.

[10]  Thomas Zimmermann,et al.  Duplicate bug reports considered harmful … really? , 2008, 2008 IEEE International Conference on Software Maintenance.

[11]  Harald C. Gall,et al.  Predicting the fix time of bugs , 2010, RSSE '10.

[12]  Thomas Zimmermann,et al.  What Makes a Good Bug Report? , 2008, IEEE Transactions on Software Engineering.

[13]  Miryung Kim,et al.  Validity concerns in software engineering research , 2010, FoSER '10.

[14]  Brad A. Myers,et al.  A Linguistic Analysis of How People Describe Software Problems , 2006, Visual Languages and Human-Centric Computing (VL/HCC'06).

[15]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Andreas Zeller,et al.  How Long Will It Take to Fix This Bug? , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[17]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[18]  Per Runeson,et al.  Guidelines for conducting and reporting case study research in software engineering , 2009, Empirical Software Engineering.

[19]  Bart Goethals,et al.  Predicting the severity of a reported bug , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[20]  Siau-Cheng Khoo,et al.  Towards more accurate retrieval of duplicate bug reports , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[21]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[22]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[23]  Harald C. Gall,et al.  Analysing Software Repositories to Understand Software Evolution , 2008, Software Evolution.

[24]  David Lo,et al.  Tag recommendation in software information sites , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[25]  Gail C. Murphy,et al.  Automatic bug triage using text categorization , 2004, SEKE.

[26]  Gail C. Murphy,et al.  Who should fix this bug? , 2006, ICSE.

[27]  Michele Lanza,et al.  An extensive comparison of bug prediction approaches , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[28]  Rongxin Wu,et al.  Dealing with noise in defect prediction , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[29]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[30]  David Lo,et al.  Information Retrieval Based Nearest Neighbor Classification for Fine-Grained Bug Severity Prediction , 2012, 2012 19th Working Conference on Reverse Engineering.

[31]  Yu Zhou,et al.  A Bayesian Network Based Approach for Change Coupling Prediction , 2008, 2008 15th Working Conference on Reverse Engineering.

[32]  He Jiang,et al.  Developer prioritization in bug repositories , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[33]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[34]  David Lo,et al.  Automatic classification of software related microblogs , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[35]  David Lo,et al.  It's not a bug, it's a feature: does misclassification affect bug localization? , 2014, MSR 2014.

[36]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[37]  Thomas Zimmermann,et al.  Extracting structural information from bug reports , 2008, MSR '08.

[38]  Elaine J. Weyuker,et al.  Predicting the location and number of faults in large software systems , 2005, IEEE Transactions on Software Engineering.

[39]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[40]  Ken-ichi Matsumoto,et al.  Predicting Re-opened Bugs: A Case Study on the Eclipse Project , 2010, 2010 17th Working Conference on Reverse Engineering.

[41]  Mladen A. Vouk,et al.  On predicting the time taken to correct bug reports in open source projects , 2009, 2009 IEEE International Conference on Software Maintenance.

[42]  Tim Menzies,et al.  Automated severity assessment of software defect reports , 2008, 2008 IEEE International Conference on Software Maintenance.

[43]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[44]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.