Measuring LDA topic stability from clusters of replicated runs

Background: Unstructured and textual data is increasing rapidly and Latent Dirichlet Allocation (LDA) topic modeling is a popular data analysis methods for it. Past work suggests that instability of LDA topics may lead to systematic errors. Aim: We propose a method that relies on replicated LDA runs, clustering, and providing a stability metric for the topics. Method: We generate k LDA topics and replicate this process n times resulting in n*k topics. Then we use K-medioids to cluster the n*k topics to k clusters. The k clusters now represent the original LDA topics and we present them like normal LDA topics showing the ten most probable words. For the clusters, we try multiple stability metrics, out of which we recommend Rank-Biased Overlap, showing the stability of the topics inside the clusters. Results: We provide an initial validation where our method is used for 270,000 Mozilla Firefox commit messages with k=20 and n=20. We show how our topic stability metrics are related to the contents of the topics. Conclusions: Advances in text mining enable us to analyze large masses of text in software engineering but non-deterministic algorithms, such as LDA, may lead to unreplicable conclusions. Our approach makes LDA stability transparent and is also complementary rather than alternative to many prior works that focus on LDA parameter tuning.

[1]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[2]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[3]  Vahid Garousi,et al.  Citations, research topics and active countries in software engineering: A bibliometrics study , 2016, Comput. Sci. Rev..

[4]  Christian Bird,et al.  The Art and Science of Analyzing Software Data , 2015, ICSE 2015.

[5]  Eleni Stroulia,et al.  Latent Dirichlet Allocation , 2003, The Art and Science of Analyzing Software Data.

[6]  Nenad Medvidovic,et al.  Enhancing architectural recovery using concerns , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[7]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[8]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[9]  Vahid Garousi,et al.  Citation and Topic Analysis of the ESEM Papers , 2015, 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[10]  Nicole Novielli,et al.  Bootstrapping a Lexicon for Emotional Arousal in Software Engineering , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[11]  Etienne Barnard,et al.  Evaluating topic models with stability , 2008 .

[12]  Andrea De Lucia,et al.  How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[13]  David W. Binkley,et al.  Understanding LDA in source code analysis , 2014, ICPC 2014.

[14]  Derek Greene,et al.  How Many Topics? Stability Analysis for Topic Models , 2014, ECML/PKDD.

[15]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[16]  Lucas Layman,et al.  Topic Modeling of NASA Space System Problem Reports: Research in Practice , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Richard N. Taylor,et al.  Software traceability with topic modeling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[19]  Abram Hindle,et al.  Relating requirements to implementation via topic analysis: Do topics extracted from requirements make sense to managers and developers? , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[20]  Wenguang Chen,et al.  WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation , 2015, Proc. VLDB Endow..

[21]  Bin Li,et al.  Exploring topic models in software engineering data analysis: A survey , 2016, 2016 17th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD).

[22]  Ahmed E. Hassan,et al.  Validating the Use of Topic Models for Software Evolution , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[23]  Tim Menzies,et al.  What is wrong with topic modeling? And how to fix it using search-based software engineering , 2016, Inf. Softw. Technol..

[24]  Mika Mäntylä,et al.  Prioritizing manual test cases in rapid release environments , 2017, Softw. Test. Verification Reliab..

[25]  Bogdan Dit,et al.  Feature location in source code: a taxonomy and survey , 2013, J. Softw. Evol. Process..

[26]  Kevin M. Carter,et al.  Evaluating topic quality using model clustering , 2014, 2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).