Modeling the evolution of development topics using Dynamic Topic Models

As the development of a software project progresses, its complexity grows accordingly, making it difficult to understand and maintain. During software maintenance and evolution, software developers and stakeholders constantly shift their focus between different tasks and topics. They need to investigate into software repositories (e.g., revision control systems) to know what tasks have recently been worked on and how much effort has been devoted to them. For example, if an important new feature request is received, an amount of work that developers perform on ought to be relevant to the addition of the incoming feature. If this does not happen, project managers might wonder what kind of work developers are currently working on. Several topic analysis tools based on Latent Dirichlet Allocation (LDA) have been proposed to analyze information stored in software repositories to model software evolution, thus helping software stakeholders to be aware of the focus of development efforts at various time during software evolution. Previous LDA-based topic analysis tools can capture either changes on the strengths of various development topics over time (i.e., strength evolution) or changes in the content of existing topics over time (i.e., content evolution). Unfortunately, none of the existing techniques can capture both strength and content evolution. In this paper, we use Dynamic Topic Models (DTM) to analyze commit messages within a project's lifetime to capture both strength and content evolution simultaneously. We evaluate our approach by conducting a case study on commit messages of two well-known open source software systems, jEdit and PostgreSQL. The results show that our approach could capture not only how the strengths of various development topics change over time, but also how the content of each topic (i.e., words that form the topic) changes over time. Compared with existing topic analysis approaches, our approach can provide a more complete and valuable view of software evolution to help developers better understand the evolution of their projects.

[1]  Robert L. Grossman,et al.  Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining , 2005, KDD 2005.

[2]  Sushil Krishna Bajracharya,et al.  A theory of aspects as latent topics , 2008, OOPSLA.

[3]  F BaldiPierre,et al.  A theory of aspects as latent topics , 2008 .

[4]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[5]  Letha H. Etzkorn,et al.  Configuring latent Dirichlet allocation based feature location , 2014, Empirical Software Engineering.

[6]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[7]  Michael Pilato Version Control with Subversion , 2004 .

[8]  Brian Berliner,et al.  CVS II: Parallelizing Software Dev elopment , 1998 .

[9]  David Lo,et al.  Version history, similar report, and structure: putting them together for improved bug localization , 2014, ICPC 2014.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Daniel M. Germán,et al.  The promises and perils of mining git , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[12]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[13]  E. Burton Swanson,et al.  The dimensions of maintenance , 1976, ICSE '76.

[14]  A.E. Hassan,et al.  The road ahead for Mining Software Repositories , 2008, 2008 Frontiers of Software Maintenance.

[15]  Letha H. Etzkorn,et al.  Bug localization using latent Dirichlet allocation , 2010, Inf. Softw. Technol..

[16]  Ahmed E. Hassan,et al.  Validating the Use of Topic Models for Software Evolution , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[17]  Michael W. Godfrey,et al.  What's hot and what's not: Windowed developer topic analysis , 2009, 2009 IEEE International Conference on Software Maintenance.

[18]  Santonu Sarkar,et al.  Mining business topics in source code using latent dirichlet allocation , 2008, ISEC '08.

[19]  Daniel M. Germán,et al.  What do large commits tell us?: a taxonomical study of large commits , 2008, MSR '08.

[20]  Cristina V. Lopes,et al.  An Application of Latent Dirichlet Allocation to Analyzing Software Evolution , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[21]  Ahmed E. Hassan,et al.  Mining Unstructured Software Repositories , 2014, Evolving Software Systems.

[22]  Richard N. Taylor,et al.  Software traceability with topic modeling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[23]  Bin Li,et al.  What Information in Software Historical Repositories Do We Need to Support Software Maintenance Tasks? An Approach Based on Topic Model , 2015, Computer and Information Science.

[24]  Daniela E. Damian,et al.  The promises and perils of mining GitHub , 2009, MSR 2014.

[25]  Ahmed E. Hassan,et al.  Studying software evolution using topic models , 2014, Sci. Comput. Program..

[26]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[27]  Andrea De Lucia,et al.  How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[28]  Ahmed E. Hassan,et al.  What Can OSS Mailing Lists Tell Us? A Preliminary Psychometric Text Analysis of the Apache Developer Mailing List , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[29]  Jonathan I. Maletic,et al.  Journal of Software Maintenance and Evolution: Research and Practice Survey a Survey and Taxonomy of Approaches for Mining Software Repositories in the Context of Software Evolution , 2022 .

[30]  Junwu Zhu,et al.  Empirical studies on the NLP techniques for source code data preprocessing , 2014, EAST 2014.