Source code analysis with LDA

Latent Dirichlet allocation (LDA) has seen increasing use in the understanding of source code and its related artifacts in part because of its impressive modeling power. However, this expressive power comes at a cost: The technique includes several tuning parameters whose impact on the resulting LDA model must be carefully considered. The aim of this work is to provide insights into the tuning parameters' impact. Doing so improves the comprehension of both researchers who look to exploit the power of LDA in their research and those who interpret the output of LDA‐using tools. It is important to recognize that the goal of this work is not to establish values for the tuning parameters because there is no universal best setting. Rather, appropriate settings depend on the problem being solved, the input corpus (in this case, typically words from the source code and its supporting artifacts), and the needs of the engineer performing the analysis. This work's primary goal is to aid software engineers in their understanding of the LDA tuning parameters by demonstrating numerically and graphically the relationship between the tuning parameters and the LDA output. A secondary goal is to enable more informed setting of the parameters. Copyright © 2016 John Wiley & Sons, Ltd.

[1]  Daqing Hou,et al.  LDA Analyzer: A Tool for Exploring Topic Models , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[2]  Richard N. Taylor,et al.  Software traceability with topic modeling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[3]  Eleni Stroulia,et al.  Latent Dirichlet Allocation , 2003, The Art and Science of Analyzing Software Data.

[4]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[5]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[6]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[7]  Letha H. Etzkorn,et al.  Source Code Retrieval for Bug Localization Using Latent Dirichlet Allocation , 2008, 2008 15th Working Conference on Reverse Engineering.

[8]  David W. Binkley,et al.  Normalizing Source Code Vocabulary , 2010, 2010 17th Working Conference on Reverse Engineering.

[9]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[10]  Ahmed E. Hassan,et al.  Modeling the evolution of topics in source code histories , 2011, MSR '11.

[11]  Andrea De Lucia,et al.  How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[12]  Ahmed E. Hassan,et al.  Studying the relationship between logging characteristics and the code quality of platform software , 2015, Empirical Software Engineering.

[13]  Abram Hindle,et al.  Deficient documentation detection a methodology to locate deficient project documentation using topic analysis , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[14]  Eleni Stroulia,et al.  A contextual approach towards more accurate duplicate bug report detection and ranking , 2013, Empirical Software Engineering.

[15]  Bogdan Dit,et al.  An exploratory analysis of mobile development issues using stack overflow , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[16]  Denys Poshyvanyk,et al.  Using Latent Dirichlet Allocation for automatic categorization of software , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[17]  Stephen W. Thomas Mining software repositories using topic models , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[18]  Andrea De Lucia,et al.  On the Equivalence of Information Retrieval Methods for Automated Traceability Link Recovery , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[19]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[20]  Sushil Krishna Bajracharya,et al.  Mining concepts from code with probabilistic topic models , 2007, ASE.

[21]  Bin Li,et al.  What Information in Software Historical Repositories Do We Need to Support Software Maintenance Tasks? An Approach Based on Topic Model , 2015, Computer and Information Science.

[22]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[23]  David B. Skillicorn,et al.  Using heuristics to estimate an appropriate number of latent topics in source code analysis , 2013, Sci. Comput. Program..

[24]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[25]  David W. Binkley,et al.  Expanding identifiers to normalize source code vocabulary , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[26]  Abram Hindle,et al.  Do topics make sense to managers and developers? , 2014, Empirical Software Engineering.

[27]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..