What is wrong with topic modeling? And how to fix it using search-based software engineering

Abstract Context Topic modeling finds human-readable structures in unstructured textual data. A widely used topic modeling technique is Latent Dirichlet allocation. When running on different datasets, LDA suffers from “order effects”, i.e., different topics are generated if the order of training data is shuffled. Such order effects introduce a systematic error for any study. This error can relate to misleading results; specifically, inaccurate topic descriptions and a reduction in the efficacy of text mining classification results. Objective To provide a method in which distributions generated by LDA are more stable and can be used for further analysis. Method We use LDADE, a search-based software engineering tool which uses Differential Evolution (DE) to tune the LDA’s parameters. LDADE is evaluated on data from a programmer information exchange site (Stackoverflow), title and abstract text of thousands of Software Engineering (SE) papers, and software defect reports from NASA. Results were collected across different implementations of LDA (Python+Scikit-Learn, Scala+Spark) across Linux platform and for different kinds of LDAs (VEM, Gibbs sampling). Results were scored via topic stability and text mining classification accuracy. Results In all treatments: (i) standard LDA exhibits very large topic instability; (ii) LDADE’s tunings dramatically reduce cluster instability; (iii) LDADE also leads to improved performances for supervised as well as unsupervised learning. Conclusion Due to topic instability, using standard LDA with its “off-the-shelf” settings should now be depreciated. Also, in future, we should require SE papers that use LDA to test and (if needed) mitigate LDA topic instability. Finally, LDADE is a candidate technology for effectively and efficiently reducing that instability.

[1]  Xiaolei Han,et al.  Particle Swarm-Simulated Annealing Fusion Algorithm and its Application in Function Optimization , 2008, 2008 International Conference on Computer Science and Software Engineering.

[2]  Rafael Caballero,et al.  SSPMO: A Scatter Tabu Search Procedure for Non-Linear Multiobjective Optimization , 2007, INFORMS J. Comput..

[3]  Jane Cleland-Huang,et al.  Improving trace accuracy through data-driven configuration and composition of tracing features , 2013, ESEC/FSE 2013.

[4]  Yuming Zhou,et al.  Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models , 2016, SIGSOFT FSE.

[5]  Nico L. U. van Meeteren, Paul J. M. Helders Why? , 2000 .

[6]  Kristina Winbladh,et al.  Analysis of user comments: An approach for software requirements evolution , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[7]  Tim Menzies,et al.  Converging on the optimal attainment of requirements , 2002, Proceedings IEEE Joint International Conference on Requirements Engineering.

[9]  Ahmed E. Hassan,et al.  Studying software evolution using topic models , 2014, Sci. Comput. Program..

[10]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[11]  Tim Menzies,et al.  Trends in Topics at SE Conferences (1993-2013) , 2016, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[12]  René Thomsen,et al.  A comparative study of differential evolution, particle swarm optimization, and evolutionary algorithms on numerical benchmark problems , 2004, Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753).

[13]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[14]  Abram Hindle,et al.  Relating requirements to implementation via topic analysis: Do topics extracted from requirements make sense to managers and developers? , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Emad Shihab,et al.  What are mobile developers asking about? A large scale study using stack overflow , 2016, Empirical Software Engineering.

[17]  David W. Binkley,et al.  Understanding LDA in source code analysis , 2014, ICPC 2014.

[18]  Ying Fu,et al.  Automated classification of software change messages by semi-supervised Latent Dirichlet Allocation , 2015, Inf. Softw. Technol..

[19]  Tom Mens,et al.  A historical dataset of software engineering conferences , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[20]  Karim O. Elish,et al.  Predicting defect-prone software modules using support vector machines , 2008, J. Syst. Softw..

[21]  Hareton K. N. Leung,et al.  MSR4SM: Using topic models to effectively mining software repositories for software maintenance tasks , 2015, Inf. Softw. Technol..

[22]  Tom Mens,et al.  How healthy are software engineering conferences? , 2014, Sci. Comput. Program..

[23]  Sergey I. Nikolenko,et al.  Topic modelling for qualitative studies , 2017, J. Inf. Sci..

[24]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[25]  Tim Menzies,et al.  Scalable product line configuration: A straw to break the camel's back , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[26]  Sergey I. Nikolenko,et al.  Latent dirichlet allocation: stability and applications to studies of user-generated content , 2014, WebSci '14.

[27]  Amritanshu Agrawal,et al.  The 'BigSE' Project: Lessons Learned from Validating Industrial Text Mining , 2016, 2016 IEEE/ACM 2nd International Workshop on Big Data Software Engineering (BIGDSE).

[28]  Andrea De Lucia,et al.  How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[29]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[30]  Bin Li,et al.  Exploring topic models in software engineering data analysis: A survey , 2016, 2016 17th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD).

[31]  Letha H. Etzkorn,et al.  Bug localization using latent Dirichlet allocation , 2010, Inf. Softw. Technol..

[32]  Ricardo P. Beausoleil,et al.  "MOSS" multiobjective scatter search applied to non-linear multiple criteria optimization , 2006, Eur. J. Oper. Res..

[33]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[34]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[35]  Stephen W. Thomas Mining software repositories using topic models , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[36]  Yi Yang Improving the Usability of Topic Models , 2015 .

[37]  Santonu Sarkar,et al.  Mining business topics in source code using latent dirichlet allocation , 2008, ISEC '08.

[38]  Weizhong Zhao,et al.  A heuristic approach to determine an appropriate number of topics in topic modeling , 2015, BMC Bioinformatics.

[39]  Citations , 2002 .

[40]  Fred W. Glover,et al.  The general employee scheduling problem. An integration of MS and AI , 1986, Comput. Oper. Res..

[41]  Ahmed E. Hassan,et al.  Explaining software defects using topic models , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[42]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[43]  Yuanyuan Zhang,et al.  The App Sampling Problem for App Store Mining , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[44]  Bogdan Dit,et al.  An exploratory analysis of mobile development issues using stack overflow , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[45]  Enio G. Jelihovschi,et al.  ScottKnott: A Package for Performing the Scott-Knott Clustering Algorithm in R , 2014 .

[46]  Tim Menzies,et al.  Automated severity assessment of software defect reports , 2008, 2008 IEEE International Conference on Software Maintenance.

[47]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[48]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[49]  Tim Menzies,et al.  GALE: Geometric Active Learning for Search-Based Software Engineering , 2015, IEEE Transactions on Software Engineering.

[50]  Shane McIntosh,et al.  Automated Parameter Optimization of Classification Techniques for Defect Prediction Models , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[51]  Allen T. Goldberg,et al.  On the Complexity of the Satisfiability Problem , 2015 .

[52]  Tim Menzies,et al.  Tuning for Software Analytics: is it Really Necessary? , 2016, Inf. Softw. Technol..

[53]  Denys Poshyvanyk,et al.  Using Latent Dirichlet Allocation for automatic categorization of software , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[54]  David Lo,et al.  Predicting Effectiveness of IR-Based Bug Localization Techniques , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[55]  Andrea De Lucia,et al.  On the Equivalence of Information Retrieval Methods for Automated Traceability Link Recovery , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[56]  Lucas Layman,et al.  Topic Modeling of NASA Space System Problem Reports: Research in Practice , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[57]  Enrique Alba,et al.  AbYSS: Adapting Scatter Search to Multiobjective Optimization , 2008, IEEE Transactions on Evolutionary Computation.

[58]  Charles A. Sutton,et al.  Why, when, and what: Analyzing Stack Overflow questions by topic, type, and code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[59]  Michael W. Godfrey,et al.  Automated topic naming to support cross-project analysis of software maintenance activities , 2011, MSR '11.

[60]  Andreas Krause,et al.  Active Learning for Multi-Objective Optimization , 2013, ICML.

[61]  Ahmed E. Hassan,et al.  Static test case prioritization using topic models , 2014, Empirical Software Engineering.

[62]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[63]  Barry W. Boehm,et al.  The business case for automated software engineering , 2007, ASE.

[64]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[65]  Walid Maalej,et al.  How Do Users Like This Feature? A Fine Grained Sentiment Analysis of App Reviews , 2014, 2014 IEEE 22nd International Requirements Engineering Conference (RE).

[66]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[67]  David B. Skillicorn,et al.  Using heuristics to estimate an appropriate number of latent topics in source code analysis , 2013, Sci. Comput. Program..

[68]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[69]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[70]  Derek Greene,et al.  An analysis of the coherence of descriptors in topic modeling , 2015, Expert Syst. Appl..

[71]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[72]  Rakesh Angira,et al.  A Comparative Study of Differential Evolution Algorithms for Estimation of Kinetic Parameters , 2012 .

[73]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[74]  Pat Langley,et al.  Models of Incremental Concept Formation , 1990, Artif. Intell..

[75]  Tom Minka,et al.  Expectation-Propogation for the Generative Aspect Model , 2002, UAI.

[76]  A. Vargha,et al.  A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong , 2000 .

[77]  P. N. Suganthan,et al.  Differential Evolution: A Survey of the State-of-the-Art , 2011, IEEE Transactions on Evolutionary Computation.

[78]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[79]  Noureddine Liouane,et al.  Tuning PID controller with multi-objective differential evolution , 2012, 2012 5th International Symposium on Communications, Control and Signal Processing.

[80]  Vahid Garousi,et al.  Citations, research topics and active countries in software engineering: A bibliometrics study , 2016, Comput. Sci. Rev..

[81]  Mark Harman,et al.  Searching for better configurations: a rigorous approach to clone evaluation , 2013, ESEC/FSE 2013.

[82]  Tim Menzies,et al.  Easy over hard: a case study on deep learning , 2017, ESEC/SIGSOFT FSE.

[83]  Sushil Krishna Bajracharya,et al.  Mining search topics from a code search engine usage log , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[84]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[85]  Ahmed E. Hassan,et al.  What are developers talking about? An analysis of topics and trends in Stack Overflow , 2014, Empirical Software Engineering.

[86]  Avinash C. Kak,et al.  Retrieval from software libraries for bug localization: a comparative study of generic and composite text models , 2011, MSR '11.

[87]  Andries Petrus Engelbrecht,et al.  Differential evolution methods for unsupervised image classification , 2005, 2005 IEEE Congress on Evolutionary Computation.

[88]  Ahmed E. Hassan,et al.  Topic-based software defect explanation , 2017, J. Syst. Softw..

[89]  Tim Menzies Improving IV&V Techniques Through the Analysis of Project Anomalies: Text Mining PITS issue reports - preliminary report , 2006 .