Using Latent Dirichlet Allocation for automatic categorization of software

In this paper, we propose a technique called LACT for automatically categorizing software systems in open-source repositories. LACT is based on Latent Dirichlet Allocation, an information retrieval method which is used to index and analyze source code documents as mixtures of probabilistic topics. For an initial evaluation, we performed two studies. In the first study, LACT was compared against an existing tool, MUDABlue, for classifying 41 software systems written in C into problem domain categories. The results indicate that LACT can automatically produce meaningful category names and yield classification results comparable to MUDABlue. In the second study, we applied LACT to 43 software systems written in different programming languages such as C/C++, Java, C#, PHP, and Perl. The results indicate that LACT can be used effectively for the automatic categorization of software systems regardless of the underlying programming language or paradigm. Moreover, both studies indicate that LACT can identify several new categories that are based on libraries, architectures, or programming languages, which is a promising improvement as compared to manual categorization and existing techniques.

[1]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[2]  Katsuro Inoue,et al.  MUDABlue: An Automatic Categorization System for Open Source Repositories , 2004, APSEC.

[3]  Gail E. Kaiser,et al.  An Information Retrieval Approach For Automatically Constructing Software Libraries , 1991, IEEE Trans. Software Eng..

[4]  Katsuro Inoue,et al.  Automatic categorization algorithm for evolvable software archive , 2003, Sixth International Workshop on Principles of Software Evolution, 2003. Proceedings..

[5]  Sushil Krishna Bajracharya,et al.  Mining concepts from code with probabilistic topic models , 2007, ASE.

[6]  Sushil Krishna Bajracharya,et al.  Mining Eclipse Developer Contributions via Author-Topic Models , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Tao Xie,et al.  SpotWeb: Detecting Framework Hotspots and Coldspots via Mining Open Source Code on the Web , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[9]  Sushil Krishna Bajracharya,et al.  A theory of aspects as latent topics , 2008, OOPSLA.

[10]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[11]  Santonu Sarkar,et al.  Mining business topics in source code using latent dirichlet allocation , 2008, ISEC '08.

[12]  Amir Michail,et al.  Assessing software libraries by browsing similar classes, functions and relationships , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[13]  Gerhard Fischer,et al.  Reuse-Conducive Development Environments , 2005, Automated Software Engineering.

[14]  Shinji Kusumoto,et al.  Ranking significance of software components based on use relations , 2003, IEEE Transactions on Software Engineering.

[15]  Lei Wang,et al.  Relevancy based semantic interoperation of reuse repositories , 2004, SIGSOFT '04/FSE-12.

[16]  Parvinder S. Sandhu,et al.  Approaches for Categorization of Reusable Software Components , 2007 .

[17]  Letha H. Etzkorn,et al.  Source Code Retrieval for Bug Localization Using Latent Dirichlet Allocation , 2008, 2008 15th Working Conference on Reverse Engineering.