论文信息 - Where Does LDA Sit for GitHub?

Where Does LDA Sit for GitHub?

A lot of research in Software Engineering (SE) automatically extract topics of the text data and use the results directly or as a feature for a machine learning method. Research has shown that the majority of studies in SE use Latent Dirichlet Allocation (LDA) as the topic modeling approach. Similarly, there is a lot of work that apply LDA on GitHub data. However, there is no study that explores whether LDA is a good choice compared to other algorithms, nor is there any to investigate the effects of specific pre-processing steps on its performance. In this paper, we explore a large dataset of GitHub repositories and apply two main topic modeling algorithms, LDA (3 variants) and Non-Negative Matrix Factorization (NMF), in several experiments with different experimental settings. The results show that LDA results in a higher coherence score compared to NMF. However, care should be taken in the choice of LDA algorithm, setting its parameters, and the text pre-processing steps. The results of this paper benefit SE researchers who apply intelligent techniques using LDA.

[1] Dr. Charu C. Aggarwal. Machine Learning for Text , 2018, Springer International Publishing.

[2] Timothy Baldwin,et al. Automatic Evaluation of Topic Coherence , 2010, NAACL.

[3] Jaegul Choo,et al. Nonnegative Matrix Factorization for Interactive Topic Modeling and Document Clustering , 2014 .

[4] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5] Harald C. Gall,et al. How can i improve my app? Classifying user reviews for software maintenance and evolution , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[6] Jaegul Choo,et al. Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations , 2018, WWW.

[7] Hui Zhang,et al. Experimental explorations on short text topic mining between LDA and NMF based Schemes , 2019, Knowl. Based Syst..

[8] Brian D. Davison,et al. Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[9] H. Sebastian Seung,et al. Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[10] Chanchal Kumar Roy,et al. An insight into the pull requests of GitHub , 2014, MSR 2014.

[11] Subhabrata Chakraborti,et al. Nonparametric Statistical Inference , 2011, International Encyclopedia of Statistical Science.

[12] David Lo,et al. Cataloging GitHub Repositories , 2017, EASE.

[13] Jianyong Wang,et al. A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[14] Sinno Jialin Pan,et al. Short and Sparse Text Topic Modeling via Self-Aggregation , 2015, IJCAI.

[15] Georgios Gousios,et al. The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[16] Vahid Garousi,et al. Citations, research topics and active countries in software engineering: A bibliometrics study , 2016, Comput. Sci. Rev..

[17] Jaegul Choo,et al. Simultaneous Discovery of Common and Discriminative Topics via Joint Nonnegative Matrix Factorization , 2015, KDD.

[18] Ahmed E. Hassan,et al. A survey on the use of topic models when mining software repositories , 2015, Empirical Software Engineering.

[19] Jiafeng Guo,et al. BTM: Topic Modeling over Short Texts , 2014, IEEE Transactions on Knowledge and Data Engineering.

[20] Maleknaz Nayebi,et al. Toward Data-Driven Requirements Engineering , 2016, IEEE Software.

[21] Premkumar T. Devanbu,et al. A large scale study of programming languages and code quality in github , 2014, SIGSOFT FSE.

[22] Derek Greene,et al. An analysis of the coherence of descriptors in topic modeling , 2015, Expert Syst. Appl..