Explaining software defects using topic models

Researchers have proposed various metrics based on measurable aspects of the source code entities (e.g., methods, classes, files, or modules) and the social structure of a software project in an effort to explain the relationships between software development and software defects. However, these metrics largely ignore the actual functionality, i.e., the conceptual concerns, of a software system, which are the main technical concepts that reflect the business logic or domain of the system. For instance, while lines of code may be a good general measure for defects, a large entity responsible for simple I/O tasks is likely to have fewer defects than a small entity responsible for complicated compiler implementation details. In this paper, we study the effect of conceptual concerns on code quality. We use a statistical topic modeling technique to approximate software concerns as topics; we then propose various metrics on these topics to help explain the defect-proneness (i.e., quality) of the entities. Paramount to our proposed metrics is that they take into account the defect history of each topic. Case studies on multiple versions of Mozilla Firefox, Eclipse, and Mylyn show that (i) some topics are much more defect-prone than others, (ii) defect-prone topics tend to remain so over time, and (iii) defect-prone topics provide additional explanatory power for code quality over existing structural and historical metrics.

[1]  Tibor Gyimóthy,et al.  Modeling class cohesion as mixtures of latent topics , 2009, 2009 IEEE International Conference on Software Maintenance.

[2]  I. Jolliffe Principal Component Analysis , 2002 .

[3]  Richard N. Taylor,et al.  Software traceability with topic modeling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[4]  Tsutomu Ishida,et al.  Metrics and Models in Software Quality Engineering , 1995 .

[5]  Padmanabhan Santhanam,et al.  Exploring defect data from development and customer usage on software modules over multiple releases , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[6]  Denys Poshyvanyk,et al.  Using Latent Dirichlet Allocation for automatic categorization of software , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[7]  Jarrett Rosenberg,et al.  Some misconceptions about lines of code , 1997, Proceedings Fourth International Software Metrics Symposium.

[8]  Denys Poshyvanyk,et al.  Using Relational Topic Models to capture coupling among classes in object-oriented software systems , 2010, 2010 IEEE International Conference on Software Maintenance.

[9]  A. Raftery Bayesian Model Selection in Social Research , 1995 .

[10]  Ahmed E. Hassan,et al.  Modeling the evolution of topics in source code histories , 2011, MSR '11.

[11]  Santonu Sarkar,et al.  Mining business topics in source code using latent dirichlet allocation , 2008, ISEC '08.

[12]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[13]  Leo R Beard,et al.  Statistical Methods in Hydrology , 1962 .

[14]  N. Nagappan,et al.  Use of relative code churn measures to predict system defect density , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[15]  Daryl Pregibon,et al.  An analysis of static metrics and faults in C software , 1985, J. Syst. Softw..

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Tu Minh Phuong,et al.  Topic-based defect prediction: NIER track , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[18]  Mayuram S. Krishnan,et al.  Evaluating the cost of software quality , 1998, CACM.

[19]  J. E. Jackson,et al.  Factor analysis, an applied approach , 1983 .

[20]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[21]  Tu Minh Phuong,et al.  Topic-based defect prediction. , 2011, ICSE 2011.

[22]  David R. Anderson,et al.  Multimodel Inference , 2004 .

[23]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[24]  Avinash C. Kak,et al.  Retrieval from software libraries for bug localization: a comparative study of generic and composite text models , 2011, MSR '11.

[25]  Scott Grant,et al.  Estimating the Optimal Number of Latent Concepts in Source Code Analysis , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[26]  Yann-Gaël Guéhéneuc,et al.  Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval , 2007, IEEE Transactions on Software Engineering.

[27]  Harald C. Gall,et al.  Don't touch my code!: examining the effects of ownership on software quality , 2011, ESEC/FSE '11.

[28]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[29]  Cristina V. Lopes,et al.  An Application of Latent Dirichlet Allocation to Analyzing Software Evolution , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[30]  Letha H. Etzkorn,et al.  Bug localization using latent Dirichlet allocation , 2010, Inf. Softw. Technol..

[31]  Ahmed E. Hassan,et al.  Validating the Use of Topic Models for Software Evolution , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[32]  Denys Poshyvanyk,et al.  Using structural and textual information to capture feature coupling in object-oriented software , 2011, Empirical Software Engineering.

[33]  James H. Stapleton Models for Probability and Statistical Inference: Theory and Applications , 2007 .

[34]  Michele Lanza,et al.  An extensive comparison of bug prediction approaches , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[35]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[36]  Michael English,et al.  An empirical analysis of information retrieval based concept location techniques in software comprehension , 2008, Empirical Software Engineering.

[37]  Sushil Krishna Bajracharya,et al.  A theory of aspects as latent topics , 2008, OOPSLA.