Architecture Recovery Using Latent Semantic Indexing and K-Means: An Empirical Evaluation

A number of clustering based approaches and tools have been proposed in the past to partition a software system into subsystems. The greater part of these approaches is semiautomatic, thus requiring human decision to identify the best partition of software entities into clusters among the possible partitions. In addition, some approaches are conceived for software systems implemented using a particular programming language (e.g., C and C++). In this paper we present an approach to automate the partitioning of a given software system into subsystems. In particular, the approach first analyzes the software entities (e.g., programs or classes) and then using Latent Semantic Indexing the dissimilarity between these entities is computed. Finally, software entities are grouped using iteratively the k-means clustering algorithm. The approach has been implemented in a prototype of a supporting software system as an Eclipse plug-in. Finally, to assess the approach and the plug-in, we have conducted an empirical investigation on three open source software systems implemented using the programming languages Java and C/C++.

[1]  Giuseppe Scanniello,et al.  A Probabilistic Based Approach towards Software System Clustering , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[2]  H. Kaiser The Application of Electronic Computers to Factor Analysis , 1960 .

[3]  Marvin V. Zelkowitz,et al.  Principles of software engineering and design , 1979 .

[4]  Richard C. Holt,et al.  MoJo: a distance metric for software clusterings , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[5]  Hausi A. Müller,et al.  A reverse-engineering approach to subsystem structure identification , 1993, J. Softw. Maintenance Res. Pract..

[6]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[7]  Richard C. Holt,et al.  Comparison of clustering algorithms in the context of software evolution , 2005, 21st IEEE International Conference on Software Maintenance (ICSM'05).

[8]  Onaiza Maqbool,et al.  Hierarchical Clustering for Software Architecture Recovery , 2007, IEEE Transactions on Software Engineering.

[9]  Mohamed E. Fayad Software Maintenance , 2005, IEEE Softw..

[10]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[11]  Giuseppe Scanniello,et al.  Identifying similar pages in Web applications using a competitive clustering algorithm , 2007, J. Softw. Maintenance Res. Pract..

[12]  Stéphane Ducasse,et al.  Moose: A Collaborative and Extensible Reengineering Environment , 2005, Tools for Software Maintenance and Reengineering.

[13]  L. Guttman Some necessary conditions for common-factor analysis , 1954 .

[14]  Rainer Koschke,et al.  Atomic architectural component recovery for program understanding and evolution , 2002, International Conference on Software Maintenance, 2002. Proceedings..

[15]  Audris Mockus,et al.  Does Code Decay? Assessing the Evidence from Change Management Data , 2001, IEEE Trans. Software Eng..

[16]  Richard C. Holt,et al.  Linux as a case study: its extracted software architecture , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[17]  Spiros Mancoridis,et al.  On the automatic modularization of software systems using the Bunch tool , 2006, IEEE Transactions on Software Engineering.

[18]  Paolo Tonella,et al.  Reverse Engineering of Object Oriented Code , 2005, Monographs in Computer Science.

[19]  Richard C. Holt,et al.  On the stability of software clustering algorithms , 2000, Proceedings IWPC 2000. 8th International Workshop on Program Comprehension.

[20]  T. A. Wiggerts,et al.  Using clustering algorithms in legacy systems remodularization , 1997, Proceedings of the Fourth Working Conference on Reverse Engineering.

[21]  Paolo Tonella,et al.  Concept Analysis for Module Restructuring , 2001, IEEE Trans. Software Eng..

[22]  Arie van Deursen,et al.  Symphony: view-driven software architecture reconstruction , 2004, Proceedings. Fourth Working IEEE/IFIP Conference on Software Architecture (WICSA 2004).

[23]  Genny Tortora,et al.  Recovering traceability links in software artifact management systems using information retrieval methods , 2007, TSEM.

[24]  Paolo Tonella,et al.  Improving Web site understanding with keyword-based clustering , 2008 .

[25]  Spiros Mancoridis,et al.  Automatic clustering of software systems using a genetic algorithm , 1999, STEP '99. Proceedings Ninth International Workshop Software Technology and Engineering Practice.

[26]  Gail E. Kaiser,et al.  An Information Retrieval Approach For Automatically Constructing Software Libraries , 1991, IEEE Trans. Software Eng..

[27]  Giuseppe Scanniello,et al.  An investigation of clustering algorithms in the identification of similar web pages , 2009 .

[28]  Vassilios Tzerpos,et al.  An optimal algorithm for MoJo distance , 2003, 11th IEEE International Workshop on Program Comprehension, 2003..

[29]  Richard E. Fairley Principles of Software Engineering , 2011 .

[30]  Dalton Serey Guerrero,et al.  Comparison of Graph Clustering Algorithms for Recovering Software Architecture Module Views , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[31]  Donna K. Harman,et al.  Ranking Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[32]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[33]  Meir M. Lehman,et al.  Program evolution , 1984, Inf. Process. Manag..

[34]  Tomas Klos,et al.  Knowledge discovery in virtual community texts: Clustering virtual communities , 2003, J. Intell. Fuzzy Syst..

[35]  Nicolas Anquetil,et al.  Experiments with clustering as a software remodularization method , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[36]  Giuseppe Scanniello,et al.  Clustering Algorithms and Latent Semantic Indexing to Identify Similar Pages in Web Applications , 2007, 2007 9th IEEE International Workshop on Web Site Evolution.

[37]  Robert W. Schwanke,et al.  An intelligent tool for re-engineering software modularity , 1991, [1991 Proceedings] 13th International Conference on Software Engineering.

[38]  Andrian Marcus,et al.  Supporting program comprehension using semantic and structural information , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.