Where Does LDA Sit for GitHub?

A lot of research in Software Engineering (SE) automatically extract topics of the text data and use the results directly or as a feature for a machine learning method. Research has shown that the majority of studies in SE use Latent Dirichlet Allocation (LDA) as the topic modeling approach. Similarly, there is a lot of work that apply LDA on GitHub data. However, there is no study that explores whether LDA is a good choice compared to other algorithms, nor is there any to investigate the effects of specific pre-processing steps on its performance. In this paper, we explore a large dataset of GitHub repositories and apply two main topic modeling algorithms, LDA (3 variants) and Non-Negative Matrix Factorization (NMF), in several experiments with different experimental settings. The results show that LDA results in a higher coherence score compared to NMF. However, care should be taken in the choice of LDA algorithm, setting its parameters, and the text pre-processing steps. The results of this paper benefit SE researchers who apply intelligent techniques using LDA.

[1]  Dr. Charu C. Aggarwal Machine Learning for Text , 2018, Springer International Publishing.

[2]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[3]  Jaegul Choo,et al.  Nonnegative Matrix Factorization for Interactive Topic Modeling and Document Clustering , 2014 .

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Harald C. Gall,et al.  How can i improve my app? Classifying user reviews for software maintenance and evolution , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[6]  Jaegul Choo,et al.  Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations , 2018, WWW.

[7]  Hui Zhang,et al.  Experimental explorations on short text topic mining between LDA and NMF based Schemes , 2019, Knowl. Based Syst..

[8]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[9]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[10]  Chanchal Kumar Roy,et al.  An insight into the pull requests of GitHub , 2014, MSR 2014.

[11]  Subhabrata Chakraborti,et al.  Nonparametric Statistical Inference , 2011, International Encyclopedia of Statistical Science.

[12]  David Lo,et al.  Cataloging GitHub Repositories , 2017, EASE.

[13]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[14]  Sinno Jialin Pan,et al.  Short and Sparse Text Topic Modeling via Self-Aggregation , 2015, IJCAI.

[15]  Georgios Gousios,et al.  The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[16]  Vahid Garousi,et al.  Citations, research topics and active countries in software engineering: A bibliometrics study , 2016, Comput. Sci. Rev..

[17]  Jaegul Choo,et al.  Simultaneous Discovery of Common and Discriminative Topics via Joint Nonnegative Matrix Factorization , 2015, KDD.

[18]  Ahmed E. Hassan,et al.  A survey on the use of topic models when mining software repositories , 2015, Empirical Software Engineering.

[19]  Jiafeng Guo,et al.  BTM: Topic Modeling over Short Texts , 2014, IEEE Transactions on Knowledge and Data Engineering.

[20]  Maleknaz Nayebi,et al.  Toward Data-Driven Requirements Engineering , 2016, IEEE Software.

[21]  Premkumar T. Devanbu,et al.  A large scale study of programming languages and code quality in github , 2014, SIGSOFT FSE.

[22]  Derek Greene,et al.  An analysis of the coherence of descriptors in topic modeling , 2015, Expert Syst. Appl..