Latent Dirichlet Allocation

Abstract Topic analysis is a powerful tool that extracts “topics” from document collections. Unlike manual tagging, which is effort intensive and requires expertise in the documents’ subject matter, topic analysis (in its simplest form) is an automated process. Relying on the assumption that each document in a collection refers to a small number of topics, it extracts bags of words attributable to these topics. These topics can be used to support document retrieval or to relate documents to each other through their associated topics. Given the variety and amount of textual information included in software repositories, in issue reports, in commit and source-code comments, and in other forms of documentation, this method has found many applications in the software-engineering field of mining software repositories. This chapter provides an overview of the theory underlying latent Dirichlet allocation (LDA), the most popular topic-analysis method today. Next, it illustrates, with a brief tutorial introduction, how to employ LDA on a textual data set. Third, it reviews the software-engineering literature for uses of LDA for analyzing textual software-development assets, in order to support developers’ activities. Finally, we discuss the interpretability of the automatically extracted topics, and their correlation with tags provided by subject-matter experts.

[1]  Andrea De Lucia,et al.  Information Retrieval Methods for Automated Traceability Recovery , 2012, Software and Systems Traceability.

[2]  Denys Poshyvanyk Using information retrieval to support software maintenance tasks , 2009, 2009 IEEE International Conference on Software Maintenance.

[3]  Andrian Marcus,et al.  An information retrieval approach to concept location in source code , 2004, 11th Working Conference on Reverse Engineering.

[4]  Richard N. Taylor,et al.  Software traceability with topic modeling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[5]  Letha H. Etzkorn,et al.  Source Code Retrieval for Bug Localization Using Latent Dirichlet Allocation , 2008, 2008 15th Working Conference on Reverse Engineering.

[6]  Bogdan Dit,et al.  TopicXP: Exploring topics in source code using Latent Dirichlet Allocation , 2010, 2010 IEEE International Conference on Software Maintenance.

[7]  Sushil Krishna Bajracharya,et al.  A theory of aspects as latent topics , 2008, OOPSLA.

[8]  Haiming Wang,et al.  Recommendation-Assisted Personal Web , 2013, 2013 IEEE Ninth World Congress on Services.

[9]  Scott Grant,et al.  Estimating the Optimal Number of Latent Concepts in Source Code Analysis , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[10]  Andrea De Lucia,et al.  Using IR methods for labeling source code artifacts: Is it worthwhile? , 2012, 2012 20th IEEE International Conference on Program Comprehension (ICPC).

[11]  Abram Hindle,et al.  Relating requirements to implementation via topic analysis: Do topics extracted from requirements make sense to managers and developers? , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[12]  Balasubramaniam Ramesh,et al.  Factors influencing requirements traceability practice , 1998, CACM.

[13]  Andrea De Lucia,et al.  How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[14]  John Langford Vowpal Wabbit , 2014 .

[15]  Abram Hindle,et al.  Deficient documentation detection a methodology to locate deficient project documentation using topic analysis , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[16]  Andrea De Lucia,et al.  On integrating orthogonal information retrieval methods to improve traceability recovery , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[17]  Collin McMillan,et al.  Combining textual and structural analysis of software artifacts for traceability link recovery , 2009, 2009 ICSE Workshop on Traceability in Emerging Forms of Software Engineering.

[18]  Eleni Stroulia,et al.  Understanding Android Fragmentation with Topic Analysis of Vendor-Specific Bugs , 2012, 2012 19th Working Conference on Reverse Engineering.

[19]  Ahmed E. Hassan,et al.  Validating the Use of Topic Models for Software Evolution , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[20]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[21]  Eleni Stroulia,et al.  A contextual approach towards more accurate duplicate bug report detection and ranking , 2013, Empirical Software Engineering.

[22]  Michael W. Godfrey,et al.  Automated topic naming to support cross-project analysis of software maintenance activities , 2011, MSR '11.