Automatic categorization algorithm for evolvable software archive

The number of software systems is increasing at a rapid rate. For example, SourceForge currently has about sixty thousand software systems registered, twenty-two thousand of which were added in the past twelve months. It is important for software evolution to search and use existing similar software systems from software archive. An evolution history of an existing similar software system is useful. We may even evolve a software system based on an existing one instead of creating it from scratch. We propose automatic software categorization algorithm to help finding similar software systems in software archive. At present, we leave open the issue about the nature of the categorization, and explore several known approaches including code clones-based similarity metric, decision trees, and latent semantic analysis. The results from applying each of the approaches gives us some insights into the problem space, and sets some directions for further work.

[1]  Audris Mockus,et al.  An Empirical Study of Speed and Communication in Globally Distributed Software Development , 2003, IEEE Trans. Software Eng..

[2]  Robert W. Schwanke,et al.  An intelligent tool for re-engineering software modularity , 1991, [1991 Proceedings] 13th International Conference on Software Engineering.

[3]  Rob Miller,et al.  Progressive open source , 2002, Proceedings of the 24th International Conference on Software Engineering. ICSE 2002.

[4]  Andrian Marcus,et al.  Recovering documentation-to-source-code traceability links using latent semantic indexing , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[5]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[6]  Susan T. Dumais,et al.  Latent semantic analysis and the measurement of knowledge , 1994 .

[7]  Nicolas Anquetil,et al.  Recovering software architecture from the names of source files , 1999, J. Softw. Maintenance Res. Pract..

[8]  Song C. Choi,et al.  Extracting and restructuring the design of large systems , 1990, IEEE Software.

[9]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[10]  Katsuro Inoue,et al.  Measuring Similarity of Large Software Systems Based on Source Code Correspondence , 2005, PROFES.

[11]  Andrian Marcus,et al.  Using latent semantic analysis to identify similarities in source code to support program understanding , 2000, Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000.