Automated classification of software change messages by semi-supervised Latent Dirichlet Allocation

Abstract Context Topic models such as probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) have demonstrated success in mining software repository tasks. Understanding software change messages described by the unstructured nature-language text is one of the fundamental challenges in mining these messages in repositories. Objective We seek to present a novel automatic change message classification method characterized by semi-supervised topic semantic analysis. Method In this work, we present a semi-supervised LDA based approach to automatically classify change messages. We use domain knowledge of software changes to make labeled samples which are added to build the semi-supervised LDA model. Next, we verify the cross-project analysis application of our method on three open-source projects. Our method has two advantages over existing software change classification methods: First of all, it mitigates the issue of how to set the appropriate number of latent topics. We do not have to choose the number of latent topics in our method, because it corresponds to the number of class labels. Second, this approach utilizes the information provided by the label samples in the training set. Results Our method automatically classified about 85% of the change messages in our experiment and our validation survey showed that 70.56% of the time our automatic classification results were in agreement with developer opinions. Conclusion Our approach automatically classifies most of the change messages which record the cause of the software change and the method is applicable to cross-project analysis of software change messages.

[1]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[2]  Ali Shokoufandeh,et al.  Studying the Evolution of Software Systems Using Change Clusters , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[3]  David B. Skillicorn,et al.  Using Topic Models to Support Software Maintenance , 2012, 2012 16th European Conference on Software Maintenance and Reengineering.

[4]  Mikko Kurimo,et al.  LATENT SEMANTIC INDEXING BY , 1999 .

[5]  Gail C. Murphy,et al.  Automatic categorization of bug reports using latent Dirichlet allocation , 2012, ISEC.

[6]  Michael W. Godfrey,et al.  Automated topic naming to support cross-project analysis of software maintenance activities , 2011, MSR '11.

[7]  Yue Lu,et al.  Opinion integration through semi-supervised topic modeling , 2008, WWW.

[8]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[9]  Sushil Krishna Bajracharya,et al.  Mining concepts from code with probabilistic topic models , 2007, ASE.

[10]  E. Burton Swanson,et al.  The dimensions of maintenance , 1976, ICSE '76.

[11]  Michele Lanza,et al.  On the nature of commits , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering - Workshops.

[12]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Ashish Sureka,et al.  Applying Fellegi-Sunter (FS) Model for Traceability Link Recovery between Bug Databases and Version Archives , 2011, 2011 18th Asia-Pacific Software Engineering Conference.

[14]  Foutse Khomh,et al.  Is it a bug or an enhancement?: a text-based approach to classify change requests , 2008, CASCON '08.

[15]  Ahmed E. Hassan,et al.  Explaining software defects using topic models , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[16]  Ahmed E. Hassan,et al.  Automated classification of change messages in open source projects , 2008, SAC '08.

[17]  Michael W. Godfrey,et al.  What's hot and what's not: Windowed developer topic analysis , 2009, 2009 IEEE International Conference on Software Maintenance.

[18]  Victor R. Basili,et al.  A classification procedure for the effective management of changes during the maintenance process , 1992, Proceedings Conference on Software Maintenance 1992.

[19]  Lillian Lee Scribes,et al.  Latent Semantic Indexing , 2007 .

[20]  HofmannThomas Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2001 .

[22]  Tibor Gyimóthy,et al.  Modeling class cohesion as mixtures of latent topics , 2009, 2009 IEEE International Conference on Software Maintenance.

[23]  Rongxin Wu,et al.  ReLink: recovering links between bugs and changes , 2011, ESEC/FSE '11.

[24]  Andrea De Lucia,et al.  Using IR methods for labeling source code artifacts: Is it worthwhile? , 2012, 2012 20th IEEE International Conference on Program Comprehension (ICPC).

[25]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[26]  Audris Mockus,et al.  Identifying reasons for software changes using historic databases , 2000, Proceedings 2000 International Conference on Software Maintenance.

[27]  Thomas Grechenig,et al.  Tracing Your Maintenance Work - A Cross-Project Validation of an Automated Classification Dictionary for Commit Messages , 2012, FASE.

[28]  Santosh S. Vempala,et al.  Latent Semantic Indexing , 2000, PODS 2000.

[29]  Stephen W. Thomas Mining software repositories using topic models , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[30]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[31]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[32]  Premkumar T. Devanbu,et al.  The missing links: bugs and bug-fix commits , 2010, FSE '10.

[33]  Jen-Tzung Chien,et al.  A new topic-bridged model for transfer learning , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[35]  Michael W. Godfrey,et al.  Automatic classication of large changes into maintenance categories , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[36]  Richard N. Taylor,et al.  Software traceability with topic modeling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[37]  Andreas Zeller,et al.  Mining Version Histories to Guide Software Changes , 2004 .

[38]  Scott Grant,et al.  Estimating the Optimal Number of Latent Concepts in Source Code Analysis , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[39]  Anh Tuan Nguyen,et al.  Multi-layered approach for recovering links between bug reports and fixes , 2012, SIGSOFT FSE.

[40]  Emily Hill,et al.  Natural Language-Based Software Analyses and Tools for Software Maintenance , 2010, ISSSE.

[41]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .