Recovering software architecture from the names of source files

We discuss how to extract a useful set of subsystems from a set of existing source-code file names. This problem is challenging because many legacy systems use thousands of files names, including some that are very short and cryptic. At the same time the problem is important because software maintainers often find it difficult to understand such systems. We propose a general algorithm to cluster files based on their names, and a set of alternative methods for implementing the algorithm. One of the key tasks is picking candidate words to try to identify in file names. We do this by (a) iteratively decomposing file names, (b) finding common substrings, and (c) choosing words in routine names, in an English dictionary or in source-code comments. In addition, we investigate generating abbreviations from the candidate words in order to find matches in file names, as well as how to split file names into components given no word markers. To compare and evaluate our five approaches, we present two experiments. The first compares the ‘concepts’ found in each file name by each method with the results of manually decomposing file names. The second experiment compares automatically generated subsystems with subsystem examples proposed by experts. We conclude that two methods are most effective: extracting concepts using common substrings and extracting those concepts that relate to the names of routines in the files. Copyright © 1999 John Wiley & Sons, Ltd.

[1]  Alex Quilici A memory-based approach to recognizing programming plans , 1994, CACM.

[2]  Alex Quilici,et al.  Some experiments toward understanding how program plan recognition algorithms scale , 1996, Proceedings of WCRE '96: 4rd Working Conference on Reverse Engineering.

[3]  Julio Cesar Sampaio do Prado Leite,et al.  Recovering Business Rules from Structured Analysis Specifications , 1995, WCRE.

[4]  Judith D. Schlesinger,et al.  JACKAL: a hierarchical approach to program understanding , 1997, Proceedings of the Fourth Working Conference on Reverse Engineering.

[5]  Renato De Mori,et al.  Source Code Informal Information Analysis Using Connectionist Models , 1993, IJCAI.

[6]  Linda M. Wills Automated Program Recognition: A Feasibility Demonstration , 1990, Artif. Intell..

[7]  Patricia Lutsky Automating testing by reverse engineering of software documentation , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[8]  James M. Neighbors Finding reusable software components in large systems , 1996, Proceedings of WCRE '96: 4rd Working Conference on Reverse Engineering.

[9]  Roy E. Kimbrell,et al.  Searching for text? Send an N-gram] , 1988 .

[10]  Arun Lakhotia,et al.  A Unified Framework For Expressing Software Subsystem Classification Techniques , 1997, J. Syst. Softw..

[11]  Nicolas Anquetil,et al.  File clustering using naming conventions for legacy systems , 1997, CASCON.

[12]  Gregory Butler,et al.  Retrieving information from data flow diagrams , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[13]  Hafedh Mili,et al.  Building and maintaining analysis-level class hierarchies using Galois Lattices , 1993, OOPSLA '93.

[14]  Ted J. Biggerstaff,et al.  Program understanding and the concept assignment problem , 1994, CACM.

[15]  Richard C. Holt,et al.  Design maintenance: unexpected architectural interactions (experience report) , 1995, Proceedings of International Conference on Software Maintenance.

[16]  Richard C. Holt,et al.  Recovering the structure of software systems using tube graph interconnection clustering , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[17]  Gordon I. McCalla,et al.  Cliche recognition in legacy software: a scalable, knowledge-based approach , 1997, Proceedings of the Fourth Working Conference on Reverse Engineering.

[18]  Hausi A. Müller,et al.  A reverse-engineering approach to subsystem structure identification , 1993, J. Softw. Maintenance Res. Pract..

[19]  Richard C. Holt,et al.  The Orphan Adoption problem in architecture maintenance , 1997, Proceedings of the Fourth Working Conference on Reverse Engineering.