Clustering Methodologies for Software Engineering

The size and complexity of industrial strength software systems are constantly increasing. This means that the task of managing a large software project is becoming even more challenging, especially in light of high turnover of experienced personnel. Software clustering approaches can help with the task of understanding large, complex software systems by automatically decomposing them into smaller, easier-to-manage subsystems. The main objective of this paper is to identify important research directions in the area of software clustering that require further attention in order to develop more effective and efficient clustering methodologies for software engineering. To that end, we first present the state of the art in software clustering research. We discuss the clustering methods that have received the most attention from the research community and outline their strengths and weaknesses. Our paper describes each phase of a clustering algorithm separately. We also present the most important approaches for evaluating the effectiveness of software clustering.

[1]  Spiros Mancoridis,et al.  Using Heuristic Search Techniques To Extract Design Abstractions From Source Code , 2002, GECCO.

[2]  Arun Lakhotia,et al.  Toward experimental evaluation of subsystem classification recovery techniques , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[3]  Rainer Koschke,et al.  Automated clustering to support the reflexion method , 2007, Inf. Softw. Technol..

[4]  Jeffrey L. Korn,et al.  Chava: reverse engineering and tracking of Java applets , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[5]  M. E. Conway HOW DO COMMITTEES INVENT , 1967 .

[6]  Richard C. Holt,et al.  ACCD: an algorithm for comprehension-driven clustering , 2000, Proceedings Seventh Working Conference on Reverse Engineering.

[7]  Richard C. Holt,et al.  A reference architecture for Web servers , 2000, Proceedings Seventh Working Conference on Reverse Engineering.

[8]  Spiros Mancoridis,et al.  On the automatic modularization of software systems using the Bunch tool , 2006, IEEE Transactions on Software Engineering.

[9]  Gregor Snelting,et al.  Assessing Modular Structure of Legacy Code Based on Mathematical Concept Analysis , 1997, Proceedings of the (19th) International Conference on Software Engineering.

[10]  Han Li,et al.  Improved Hierarchical Clustering Algorithm for Software Architecture Recovery , 2010, 2010 International Conference on Intelligent Computing and Cognitive Informatics.

[11]  Siraj Muhammad,et al.  An Improved Similarity Measure for Binary Features in Software Clustering , 2010, 2010 Second International Conference on Computational Intelligence, Modelling and Simulation.

[12]  T. A. Wiggerts,et al.  Using clustering algorithms in legacy systems remodularization , 1997, Proceedings of the Fourth Working Conference on Reverse Engineering.

[13]  Vassilios Tzerpos,et al.  Comprehension-driven software clustering , 2001 .

[14]  Hong Yan,et al.  DiscoTect: a system for discovering architectures from running systems , 2004, Proceedings. 26th International Conference on Software Engineering.

[15]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[16]  Spiros Mancoridis,et al.  Comparing the decompositions produced by software clustering algorithms using similarity measurements , 2001, Proceedings IEEE International Conference on Software Maintenance. ICSM 2001.

[17]  Tarja Systä,et al.  Static and Dynamic Reverse Engineering Techniques for Java Software Systems , 2000 .

[18]  Kata Praditwong,et al.  Solving software module clustering problem by evolutionary algorithms , 2011, 2011 Eighth International Joint Conference on Computer Science and Software Engineering (JCSSE).

[19]  Jean Bézivin,et al.  On the Need for Megamodels , 2004, OOPSLA 2004.

[20]  Robert R. Sokal,et al.  The First Decade of Numerical Taxonomy. (Book Reviews: Numerical Taxonomy. The Principles and Practice of Numerical Classification) , 1975 .

[21]  Vijay V. Raghavan Approaches for measuring the stability of clustering methods , 1982, SIGF.

[22]  David Notkin,et al.  Software reflexion models: bridging the gap between source and high-level models , 1995, SIGSOFT FSE.

[23]  Michele Lanza,et al.  Program Comprehension through Software Habitability , 2007, 15th IEEE International Conference on Program Comprehension (ICPC '07).

[24]  Jian Feng Cui,et al.  Applying agglomerative hierarchical clustering algorithms to component identification for legacy systems , 2011, Inf. Softw. Technol..

[25]  Vassilios Tzerpos,et al.  Software clustering based on omnipresent object detection , 2005, 13th International Workshop on Program Comprehension (IWPC'05).

[26]  Nenad Medvidovic,et al.  Using software evolution to focus architectural recovery , 2006, Automated Software Engineering.

[27]  Danny B. Lange,et al.  Object-Oriented Program Tracing and Visualization , 1997, Computer.

[28]  Ben Shneiderman,et al.  Identifying aggregates in hypertext structures , 1991, HYPERTEXT '91.

[29]  John Davey,et al.  Evaluating the suitability of data clustering for software remodularisation , 2000, Proceedings Seventh Working Conference on Reverse Engineering.

[30]  Stéphane Ducasse,et al.  Polymetric Views - A Lightweight Visual Approach to Reverse Engineering , 2003, IEEE Trans. Software Eng..

[31]  Daniel M. Germán,et al.  A framework for describing and understanding mining tools in software development , 2005, MSR.

[32]  Eleni Stroulia,et al.  Dynamic analysis for reverse engineering and program understanding , 2002, SIAP.

[33]  Richard C. Holt Structural manipulations of software architecture using Tarski relational algebra , 1998, Proceedings Fifth Working Conference on Reverse Engineering (Cat. No.98TB100261).

[34]  Robert W. Schwanke,et al.  An intelligent tool for re-engineering software modularity , 1991, [1991 Proceedings] 13th International Conference on Software Engineering.

[35]  Emden R. Gansner,et al.  Using automatic clustering to produce high-level system organizations of source code , 1998, Proceedings. 6th International Workshop on Program Comprehension. IWPC'98 (Cat. No.98TB100242).

[36]  Gerardo Canfora,et al.  Impact analysis by mining software and change request repositories , 2005, 11th IEEE International Software Metrics Symposium (METRICS'05).

[37]  Rainer Koschke,et al.  A framework for experimental evaluation of clustering techniques , 2000, Proceedings IWPC 2000. 8th International Workshop on Program Comprehension.

[38]  Ladan Tahvildari,et al.  An approach for measuring software evolution using source code features , 1999, Proceedings Sixth Asia Pacific Software Engineering Conference (ASPEC'99) (Cat. No.PR00509).

[39]  Timothy Lethbridge,et al.  The Dagstuhl Middle Metamodel: A Schema For Reverse Engineering , 2004, ateM.

[40]  Rainer Koschke,et al.  Atomic architectural component recovery for program understanding and evolution , 2002, International Conference on Software Maintenance, 2002. Proceedings..

[41]  Michele Lanza,et al.  Package patterns for visual architecture recovery , 2006, Conference on Software Maintenance and Reengineering (CSMR'06).

[42]  Chung-Horng Lung,et al.  Applications of clustering techniques to software partitioning, recovery and restructuring , 2004, J. Syst. Softw..

[43]  Rainer Koschke,et al.  Revisiting the Delta IC approach to component recovery , 2000, Proceedings Seventh Working Conference on Reverse Engineering.

[44]  Abdelwahab Hamou-Lhadj,et al.  Software Clustering Using Dynamic Analysis and Static Dependencies , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[45]  A. Delchambre,et al.  Grouping genetic algorithms: an efficient method to solve the cell formation problem , 2000 .

[46]  Vassilios Tzerpos,et al.  Software clustering based on dynamic dependencies , 2005, Ninth European Conference on Software Maintenance and Reengineering.

[47]  Chung-Horng Lung,et al.  Program restructuring through clustering techniques , 2004 .

[48]  Taghi M. Khoshgoftaar,et al.  Analyzing software measurement data with clustering techniques , 2004, IEEE Intelligent Systems.

[49]  Rudi Lutz,et al.  Evolving good hierarchical decompositions of complex systems , 2001, J. Syst. Archit..

[50]  Bjørn N. Freeman-Benson,et al.  Visualizing dynamic software system information through high-level models , 1998, OOPSLA '98.

[51]  Jean-Francois Girard,et al.  A comparison of abstract data types and objects recovery techniques , 2000, Sci. Comput. Program..

[52]  Spiros Xanthos,et al.  Identification of reusable components within an object-oriented software system using algebraic graph theory , 2004, OOPSLA '04.

[53]  Ali Safari Mamaghani,et al.  Clustering of Software Systems Using New Hybrid Algorithms , 2009, 2009 Ninth IEEE International Conference on Computer and Information Technology.

[54]  Mark Harman,et al.  A multiple hill climbing approach to software module clustering , 2003, International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings..

[55]  Hausi A. Müller,et al.  A reverse-engineering approach to subsystem structure identification , 1993, J. Softw. Maintenance Res. Pract..

[56]  Richard C. Holt,et al.  Studying the evolution of software systems using evolutionary code extractors , 2004 .

[57]  Gang Huang,et al.  Runtime recovery and manipulation of software architecture of component-based systems , 2006, Automated Software Engineering.

[58]  Emden R. Gansner,et al.  Bunch: a clustering tool for the recovery and maintenance of software system structures , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[59]  Gerard Salton,et al.  A blueprint for automatic Boolean query processing , 1982, SIGF.

[60]  Nicolas Anquetil,et al.  File clustering using naming conventions for legacy systems , 1997, CASCON.

[61]  Richard C. Holt,et al.  The Orphan Adoption problem in architecture maintenance , 1997, Proceedings of the Fourth Working Conference on Reverse Engineering.

[62]  Hausi A. Müller,et al.  Understanding software systems using reverse engineering technology perspectives from the Rigi project , 1993, CASCON.

[63]  Siraj Muhammad,et al.  Improved Similarity Measures for Software Clustering , 2011, 2011 15th European Conference on Software Maintenance and Reengineering.

[64]  Ali Shokoufandeh,et al.  Applying spectral methods to software clustering , 2002, Ninth Working Conference on Reverse Engineering, 2002. Proceedings..

[65]  Harald C. Gall,et al.  Populating a Release History Database from version control and bug tracking systems , 2003, International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings..

[66]  Mikhail Dmitriev Profiling Java applications using code hotswapping and dynamic call graph revelation , 2004, WOSP '04.

[67]  Lei Sun,et al.  Object-oriented software architecture recovery using a new hybrid clustering algorithm , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[68]  Andy Schürr,et al.  GXL: A graph-based standard exchange format for reengineering , 2006, Sci. Comput. Program..

[69]  Jens Dietrich,et al.  Cluster analysis of Java dependency graphs , 2008, SoftVis '08.

[70]  Richard C. Holt,et al.  MoJo: a distance metric for software clusterings , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[71]  Stéphane Ducasse,et al.  Towards A Process-Oriented Software Architecture Reconstruction Taxonomy , 2007, 11th European Conference on Software Maintenance and Reengineering (CSMR'07).

[72]  Vassilios Tzerpos,et al.  An effectiveness measure for software clustering algorithms , 2004, Proceedings. 12th IEEE International Workshop on Program Comprehension, 2004..

[73]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[74]  Song C. Choi,et al.  Extracting and restructuring the design of large systems , 1990, IEEE Software.

[75]  Vassilios Tzerpos,et al.  An optimal algorithm for MoJo distance , 2003, 11th IEEE International Workshop on Program Comprehension, 2003..

[76]  Mircea Trifu,et al.  Architecture-aware adaptive clustering of OO systems , 2004, Eighth European Conference on Software Maintenance and Reengineering, 2004. CSMR 2004. Proceedings..

[77]  Richard C. Holt,et al.  On the stability of software clustering algorithms , 2000, Proceedings IWPC 2000. 8th International Workshop on Program Comprehension.

[78]  Martin D. Levine,et al.  An Algorithm for Detecting Unimodal Fuzzy Sets and Its Application as a Clustering Technique , 1970, IEEE Transactions on Computers.

[79]  Richard C. Holt,et al.  Comparison of clustering algorithms in the context of software evolution , 2005, 21st IEEE International Conference on Software Maintenance (ICSM'05).

[80]  Daniel Amyot,et al.  Recovering behavioral design models from execution traces , 2005, Ninth European Conference on Software Maintenance and Reengineering.

[81]  R. W. Schwanke,et al.  Discovering, visualizing, and controlling software structure , 1989, IWSSD '89.

[82]  Vassilios Tzerpos,et al.  Evaluating similarity measures for software decompositions , 2004, 20th IEEE International Conference on Software Maintenance, 2004. Proceedings..

[83]  John A. Clark,et al.  Formulating software engineering as a search problem , 2003, IEE Proc. Softw..

[84]  Ali Shokoufandeh,et al.  Spectral and meta-heuristic algorithms for software clustering , 2005, J. Syst. Softw..

[85]  Nicolas Anquetil,et al.  Experiments with clustering as a software remodularization method , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[86]  H. W. Khoo,et al.  Sampling Properties of Gower's General Coefficient of Similarity , 1985 .

[87]  Xiaogang Wang,et al.  Clustering large software systems at multiple layers , 2007, Inf. Softw. Technol..

[88]  Spiros Mancoridis,et al.  CRAFT: a framework for evaluating software clustering results in the absence of benchmark decompositions [Clustering Results Analysis Framework and Tools] , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[89]  Matthias Biehl,et al.  Search-based improvement of subsystem decompositions , 2005, GECCO '05.

[90]  Periklis Andritsos,et al.  Information-theoretic software clustering , 2005, IEEE Transactions on Software Engineering.

[91]  Mark Shtern,et al.  Lossless Comparison of Nested Software Decompositions , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[92]  Spiros Mancoridis,et al.  Automatic clustering of software systems using a genetic algorithm , 1999, STEP '99. Proceedings Ninth International Workshop Software Technology and Engineering Practice.

[93]  Mark Shtern,et al.  A framework for the comparison of nested software decompositions , 2004, 11th Working Conference on Reverse Engineering.

[94]  Stéphane Ducasse,et al.  Enriching reverse engineering with semantic clustering , 2005, 12th Working Conference on Reverse Engineering (WCRE'05).

[95]  Hausi A. Müller,et al.  Reverse engineering: a roadmap , 2000, ICSE '00.

[96]  Katsuro Inoue,et al.  MUDABlue: An Automatic Categorization System for Open Source Repositories , 2004, APSEC.

[97]  Onaiza Maqbool,et al.  The weighted combined algorithm: a linkage algorithm for software clustering , 2004, Eighth European Conference on Software Maintenance and Reengineering, 2004. CSMR 2004. Proceedings..