Using heuristics to estimate an appropriate number of latent topics in source code analysis

Abstract Latent Dirichlet Allocation (LDA) is a data clustering algorithm that performs especially well for text documents. In natural-language applications it automatically finds groups of related words (called “latent topics”) and clusters the documents into sets that are about the same “topic”. LDA has also been applied to source code, where the documents are natural source code units such as methods or classes, and the words are the keywords, operators, and programmer-defined names in the code. The problem of determining a topic count that most appropriately describes a set of source code documents is an open problem. We address this empirically by constructing clusterings with different numbers of topics for a large number of software systems, and then use a pair of measures based on source code locality and topic model similarity to assess how well the topic structure identifies related source code units. Results suggest that the topic count required can be closely approximated using the number of software code fragments in the system. We extend these results to recommend appropriate topic counts for arbitrary software systems based on an analysis of a set of open source systems.

[1]  David B. Skillicorn,et al.  Topic Detection Using Independent Component Analysis , 2007 .

[2]  R. P. McDonald,et al.  Structural Equations with Latent Variables , 1989 .

[3]  Andrea De Lucia,et al.  On the Equivalence of Information Retrieval Methods for Automated Traceability Link Recovery , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[4]  M. V. Rossum,et al.  In Neural Computation , 2022 .

[5]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[6]  Richard N. Taylor,et al.  Software traceability with topic modeling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[7]  Roger B. Bradford,et al.  An empirical study of required dimensionality for large-scale latent semantic indexing applications , 2008, CIKM '08.

[8]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[9]  David B. Skillicorn,et al.  Automated Concept Location Using Independent Component Analysis , 2008, 2008 15th Working Conference on Reverse Engineering.

[10]  James R. Cordy,et al.  The TXL source transformation language , 2006, Sci. Comput. Program..

[11]  Andrian Marcus,et al.  Recovering documentation-to-source-code traceability links using latent semantic indexing , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[12]  Giuliano Antoniol,et al.  Recovering Traceability Links between Code and Documentation , 2002, IEEE Trans. Software Eng..

[13]  Chanchal Kumar Roy,et al.  Near-miss function clones in open source software : an empirical study , 2009 .

[14]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[15]  Ahmed E. Hassan,et al.  Modeling the evolution of topics in source code histories , 2011, MSR '11.

[16]  Olly Gotel,et al.  An analysis of the requirements traceability problem , 1994, Proceedings of IEEE International Conference on Requirements Engineering.

[17]  Chanchal Kumar Roy,et al.  Are scripting languages really different? , 2010, IWSC '10.

[18]  Bogdan Dit,et al.  Feature location in source code: a taxonomy and survey , 2013, J. Softw. Evol. Process..

[19]  Andrian Marcus,et al.  Using latent semantic analysis to identify similarities in source code to support program understanding , 2000, Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000.

[20]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[21]  Ahmed E. Hassan,et al.  Validating the Use of Topic Models for Software Evolution , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[22]  Michael W. Godfrey,et al.  What's hot and what's not: Windowed developer topic analysis , 2009, 2009 IEEE International Conference on Software Maintenance.

[23]  Michael W. Godfrey,et al.  Automated topic naming to support cross-project analysis of software maintenance activities , 2011, MSR '11.

[24]  P. Comon Independent Component Analysis , 1992 .

[25]  David B. Skillicorn,et al.  Using Topic Models to Support Software Maintenance , 2012, 2012 16th European Conference on Software Maintenance and Reengineering.

[26]  Stan Lipovetsky,et al.  Latent Variable Models and Factor Analysis , 2001, Technometrics.

[27]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[28]  Scott Grant,et al.  Vector space analysis of software clones , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[29]  Richard C. Holt,et al.  Studying the evolution of software systems using evolutionary code extractors , 2004 .

[30]  Sushil Krishna Bajracharya,et al.  Mining concepts from code with probabilistic topic models , 2007, ASE.

[31]  Scott Grant,et al.  Estimating the Optimal Number of Latent Concepts in Source Code Analysis , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[32]  Cristina V. Lopes,et al.  An Application of Latent Dirichlet Allocation to Analyzing Software Evolution , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[33]  Yichuan Zhang,et al.  Advances in Neural Information Processing Systems 25 , 2012 .

[34]  Jonathan I. Maletic,et al.  Automatic software clustering via Latent Semantic Analysis , 1999, 14th IEEE International Conference on Automated Software Engineering.

[35]  Santonu Sarkar,et al.  Mining business topics in source code using latent dirichlet allocation , 2008, ISEC '08.

[36]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[37]  David B. Skillicorn,et al.  Reverse Engineering Co-maintenance Relationships Using Conceptual Analysis of Source Code , 2011, 2011 18th Working Conference on Reverse Engineering.

[38]  Ingo Steinwart,et al.  On the Optimal Parameter Choice for v-Support Vector Machines , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[40]  Letha H. Etzkorn,et al.  Source Code Retrieval for Bug Localization Using Latent Dirichlet Allocation , 2008, 2008 15th Working Conference on Reverse Engineering.