Comparative study of clustering algorithms and abstract representations for software remodularisation

As valuable software systems become older, reverse engineering becomes increasingly important to companies that have to maintain the code. Clustering is a key activity in reverse engineering that is used to discover improved designs of systems or to extract significant concepts from code. Clustering is an old, highly sophisticated, activity which offers many methods to meet different needs. The various methods have been well documented in the past; however, conclusions from general clustering literature may not apply entirely to the reverse engineering domain. In the paper, the authors study three decisions that need to be made when clustering: the choice of (i) abstract descriptions of the entities to be clustered, (ii) metrics to compute coupling between the entities, and (iii) clustering algorithms. For each decision, our objective is to understand which choices are best when performing software remodularisation. The experiments were conducted on three public domain systems (gcc, Linux and Mosaic) and a real world legacy system (2 million LOC). Among other things, the authors confirm the importance of a proper description scheme for the entities being clustered, list a few effective coupling metrics and characterise the quality of different clustering algorithms. They also propose description schemes not directly based on the source code, and advocate better formal evaluation methods for the clustering results.

[1]  Gail E. Kaiser,et al.  An Information Retrieval Approach For Automatically Constructing Software Libraries , 1991, IEEE Trans. Software Eng..

[2]  Spencer Rugaber,et al.  On the knowledge required to understand a program , 1998, Proceedings Fifth Working Conference on Reverse Engineering (Cat. No.98TB100261).

[3]  Michael W. Godfrey,et al.  Architectural repair of open source software , 2000, Proceedings IWPC 2000. 8th International Workshop on Program Comprehension.

[4]  Timothy C. Lethbridge,et al.  Recovering software architecture from the names of source files , 1999 .

[5]  Gordon Kotik,et al.  Reengineering procedural into object-oriented systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[6]  Thomas W. Reps,et al.  Identifying modules via concept analysis , 1997, 1997 Proceedings International Conference on Software Maintenance.

[7]  Betty H. C. Cheng,et al.  A framework for classifying and comparing software reverse engineering and design recovery techniques , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[8]  Arun Lakhotia,et al.  A Unified Framework For Expressing Software Subsystem Classification Techniques , 1997, J. Syst. Softw..

[9]  Richard C. Holt,et al.  The Orphan Adoption problem in architecture maintenance , 1997, Proceedings of the Fourth Working Conference on Reverse Engineering.

[10]  Aniello Cimitile,et al.  A precise method for identifying reusable abstract data types in code , 1994, Proceedings 1994 International Conference on Software Maintenance.

[11]  Nicolas Anquetil,et al.  A comparison of graphs of concept for reverse engineering , 2000, Proceedings IWPC 2000. 8th International Workshop on Program Comprehension.

[12]  Aniello Cimitile,et al.  Identifying objects in legacy systems , 1997, Proceedings Fifth International Workshop on Program Comprehension. IWPC'97.

[13]  Thomas Kunz,et al.  Using Automatic Process Clustering for Design Recovery and Distributed Debugging , 1995, IEEE Trans. Software Eng..

[14]  Jean-Francois Girard,et al.  Comparison of abstract data type and abstract state encapsulation detection techniques for architectural understanding , 1997, Proceedings of the Fourth Working Conference on Reverse Engineering.

[15]  Victor R. Basili,et al.  System Structure Analysis: Clustering with Data Bindings , 1985, IEEE Transactions on Software Engineering.

[16]  M. N. Armstrong,et al.  Evaluating architectural extractors , 1998, Proceedings Fifth Working Conference on Reverse Engineering (Cat. No.98TB100261).

[17]  Betty H. C. Cheng,et al.  Using informal and formal techniques for the reverse engineering of C programs , 1996, Proceedings of WCRE '96: 4rd Working Conference on Reverse Engineering.

[18]  Gerardo Canfora,et al.  An improved algorithm for identifying objects in code , 1996 .

[19]  Gregor Snelting,et al.  Assessing Modular Structure of Legacy Code Based on Mathematical Concept Analysis , 1997, Proceedings of the (19th) International Conference on Software Engineering.

[20]  Emden R. Gansner,et al.  Using automatic clustering to produce high-level system organizations of source code , 1998, Proceedings. 6th International Workshop on Program Comprehension. IWPC'98 (Cat. No.98TB100242).

[21]  Thomas Kunz Developing a Measure for Process Cluster Evaluation Developing a Measure for Process Cluster Evaluation , 1993 .

[22]  Hausi A. Müller,et al.  A reverse-engineering approach to subsystem structure identification , 1993, J. Softw. Maintenance Res. Pract..

[23]  Donald A. Jackson,et al.  Similarity Coefficients: Measures of Co-Occurrence and Association or Simply Measures of Occurrence? , 1989, The American Naturalist.

[24]  T. A. Wiggerts,et al.  Using clustering algorithms in legacy systems remodularization , 1997, Proceedings of the Fourth Working Conference on Reverse Engineering.

[25]  Rainer Koschke,et al.  A framework for experimental evaluation of clustering techniques , 2000, Proceedings IWPC 2000. 8th International Workshop on Program Comprehension.

[26]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[27]  Richard C. Holt,et al.  On the stability of software clustering algorithms , 2000, Proceedings IWPC 2000. 8th International Workshop on Program Comprehension.

[28]  Eugene Miya,et al.  On "Software engineering" , 1985, SOEN.

[29]  Arie van Deursen,et al.  Identifying objects using cluster and concept analysis , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[30]  Richard C. Holt,et al.  Recovering the structure of software systems using tube graph interconnection clustering , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[31]  Thomas Kunz Evaluating process clusters to support automatic program understanding , 1996, WPC '96. 4th Workshop on Program Comprehension.