Automated Configuration Bug Report Prediction Using Text Mining

Configuration bugs are one of the dominant causes of software failures. Previous studies show that a configuration bug could cause huge financial losses in a software system. The importance of configuration bugs has attracted various research studies, e.g., To detect, diagnose, and fix configuration bugs. Given a bug report, an approach that can identify whether the bug is a configuration bug could help developers reduce debugging effort. We refer to this problem as configuration bug reports prediction. To address this problem, we develop a new automated framework that applies text mining technologies on the natural-language description of bug reports to train a statistical model on historical bug reports with known labels (i.e., Configuration or non-configuration), and the statistical model is then used to predict a label for a new bug report. Developers could apply our model to automatically predict labels of bug reports to improve their productivity. Our tool first applies feature selection techniques (e.g., Information gain and Chi-square) to pre-process the textual information in bug reports, and then applies various text mining techniques (e.g., Naive Bayes, SVM, naive Bayes multinomial) to build statistical models. We evaluate our solution on 5 bug report datasets including accumulo, activemq, camel, flume, and wicket. We show that naive Bayes multinomial with information gain achieves the best performance. On average across the 5 projects, its accuracy, configuration F-measure and non-configuration F-measure are 0.811, 0.450, and 0.880, respectively. We also compare our solution with the method proposed by Arshad et al. The results show that our proposed approach that uses naive Bayes multinomial with information gain on average improves accuracy, configuration F-measure and non-configuration F-measure scores of Arshad et al.'s method by 8.34%, 103.7%, and 4.24%, respectively.

[1]  Yuanyuan Zhou,et al.  Do not blame users for misconfigurations , 2013, SOSP.

[2]  Saurabh Bagchi,et al.  Characterizing configuration problems in Java EE application servers: An empirical study with GlassFish and JBoss , 2013, 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE).

[3]  Tim Menzies,et al.  Automated severity assessment of software defect reports , 2008, 2008 IEEE International Conference on Software Maintenance.

[4]  Ken-ichi Matsumoto,et al.  Predicting Re-opened Bugs: A Case Study on the Eclipse Project , 2010, 2010 17th Working Conference on Reverse Engineering.

[5]  Sinno Jialin Pan,et al.  Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[6]  Michael Kaminsky,et al.  Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles , 2013, SOSP 2013.

[7]  Anh Tuan Nguyen,et al.  Multi-layered approach for recovering links between bug reports and fixes , 2012, SIGSOFT FSE.

[8]  Ferdian Thung,et al.  Automatic Defect Categorization , 2012, 2012 19th Working Conference on Reverse Engineering.

[9]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[10]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[11]  Michael D. Ernst,et al.  Automated diagnosis of software configuration errors , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[12]  Rongxin Wu,et al.  ReLink: recovering links between bugs and changes , 2011, ESEC/FSE '11.

[13]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[14]  David Lo,et al.  Automated library recommendation , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[15]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[16]  David Lo,et al.  An Empirical Study of Bugs in Software Build Systems , 2013, 2013 13th International Conference on Quality Software.

[17]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[18]  Tao Xie,et al.  Identifying security bug reports via text mining: An industrial case study , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[19]  David Lo,et al.  Automatic Defect Categorization Based on Fault Triggering Conditions , 2014, 2014 19th International Conference on Engineering of Complex Computer Systems.

[20]  Xiao Ma,et al.  An empirical study on configuration errors in commercial and open source systems , 2011, SOSP.

[21]  Charu C. Aggarwal,et al.  Mining Text Data , 2012, Springer US.

[22]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[23]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[24]  Mona Attariyan,et al.  Automating Configuration Troubleshooting with Dynamic Information Flow Analysis , 2010, OSDI.

[25]  Mona Attariyan,et al.  Using Causality to Diagnose Configuration Bugs , 2008, USENIX Annual Technical Conference.

[26]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[27]  Yang Feng,et al.  Towards more accurate multi-label software behavior learning , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[28]  Ahmed E. Hassan,et al.  Security versus performance bugs: a case study on Firefox , 2011, MSR '11.

[29]  David Lo,et al.  Identifying Linux bug fixing patches , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[30]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[31]  David Lo,et al.  A Comparative Study of Supervised Learning Algorithms for Re-opened Bug Prediction , 2013, CSMR 2013.

[32]  David Lo,et al.  Tag recommendation in software information sites , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[33]  Helen J. Wang,et al.  Automatic Misconfiguration Troubleshooting with PeerPressure , 2004, OSDI.

[34]  Yi Zhang,et al.  Classifying Software Changes: Clean or Buggy? , 2008, IEEE Transactions on Software Engineering.

[35]  David Lo,et al.  Will Fault Localization Work for These Failures? An Automated Approach to Predict Effectiveness of Fault Localization Tools , 2013, 2013 IEEE International Conference on Software Maintenance.