Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability

In this paper, we investigate some ideas based on Machine Learning, Natural Language Processing, and Information Retrieval to outline possible research directions in the field of software architecture recovery and clone detection. In particular, after presenting an extensive related work, we illustrate two proposals for addressing these two issues, that represent hot topics in the field of Software Maintenance. Both proposals use Kernel Methods for exploiting structural representation of source code and to automate the detection of clones and the recovery of the actually implemented architecture in a subject software system.

[1]  David Garlan,et al.  Software architecture: a roadmap , 2000, ICSE '00.

[2]  Paolo Frasconi,et al.  Learning with Kernels and Logical Representations , 2007, Probabilistic Inductive Logic Programming.

[3]  Paolo Nesi,et al.  Proceedings of the Third European Conference on Software Maintenance and Reengineering, Cahapel of St. Agnes, University of Amsterdam, the Netherlands, March 3-5, 1999 , 1999 .

[4]  Roberto Basili,et al.  Tree Kernels for Semantic Role Labeling , 2008, CL.

[5]  Stéphane Ducasse,et al.  Software Architecture Reconstruction: A Process-Oriented Taxonomy , 2009, IEEE Transactions on Software Engineering.

[6]  Rainer Koschke,et al.  Clone Detection Using Abstract Syntax Suffix Trees , 2006, 2006 13th Working Conference on Reverse Engineering.

[7]  Andrian Marcus,et al.  Identification of high-level concept clones in source code , 2001, Proceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001).

[8]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[9]  Richard C. Holt,et al.  Comparison of clustering algorithms in the context of software evolution , 2005, 21st IEEE International Conference on Software Maintenance (ICSM'05).

[10]  L. Erlikh,et al.  Leveraging legacy system dollars for e-business , 2000 .

[11]  Onaiza Maqbool,et al.  Hierarchical Clustering for Software Architecture Recovery , 2007, IEEE Transactions on Software Engineering.

[12]  Rene L. Krikhaar,et al.  Software architecture reconstruction , 1999 .

[13]  Jean-Philippe Vert,et al.  A tree kernel to analyse phylogenetic profiles , 2002, ISMB.

[14]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[15]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[16]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[17]  M.M. Lehman,et al.  Programs, life cycles, and laws of software evolution , 1980, Proceedings of the IEEE.

[18]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[19]  Giuseppe Scanniello,et al.  Using fold-in and fold-out in the architecture recovery of software systems , 2011, Formal Aspects of Computing.

[20]  Walter F. Tichy,et al.  Proceedings 25th International Conference on Software Engineering , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[21]  Dietmar Seipel,et al.  Clone detection in source code by frequent itemset techniques , 2004 .

[22]  Giuseppe Scanniello,et al.  Architectural layer recovery for software system understanding and evolution , 2010 .

[23]  Giuseppe Scanniello,et al.  Combining Machine Learning and Information Retrieval Techniques for Software Clustering , 2011, EternalS@FET.

[24]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[25]  Anneliese Amschler Andrews,et al.  Program Comprehension During Software Maintenance and Evolution , 1995, Computer.

[26]  Richard C. Holt,et al.  On the stability of software clustering algorithms , 2000, Proceedings IWPC 2000. 8th International Workshop on Program Comprehension.

[27]  Wuu Yang,et al.  Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[28]  Rainer Koschke,et al.  Atomic architectural component recovery for program understanding and evolution , 2002, International Conference on Software Maintenance, 2002. Proceedings..

[29]  Luc De Raedt,et al.  Fast learning of relational kernels , 2010, Machine Learning.

[30]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[31]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[32]  Giuseppe Scanniello,et al.  A Probabilistic Based Approach towards Software System Clustering , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[33]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[34]  J. Howard Johnson,et al.  Identifying redundancy in source code using fingerprints , 1993, CASCON.

[35]  Spiros Mancoridis,et al.  On the automatic modularization of software systems using the Bunch tool , 2006, IEEE Transactions on Software Engineering.

[36]  Giuseppe Scanniello,et al.  Investigating the use of lexical information for software system clustering , 2011, 2011 15th European Conference on Software Maintenance and Reengineering.

[37]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[38]  Giuseppe Scanniello,et al.  Architectural layer recovery for software system understanding and evolution , 2010, Softw. Pract. Exp..

[39]  T. A. Wiggerts,et al.  Using clustering algorithms in legacy systems remodularization , 1997, Proceedings of the Fourth Working Conference on Reverse Engineering.

[40]  Paolo Frasconi,et al.  Weighted decomposition kernels , 2005, ICML.

[41]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[42]  Giuseppe Scanniello,et al.  Using the Kleinberg Algorithm and Vector Space Model for Software System Clustering , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[43]  Giuseppe Scanniello,et al.  A Tree Kernel based approach for clone detection , 2010, 2010 IEEE International Conference on Software Maintenance.

[44]  Luc De Raedt,et al.  Probabilistic Inductive Logic Programming , 2004, Probabilistic Inductive Logic Programming.

[45]  António Menezes Leitão Detection of Redundant Code Using R2D2 , 2004, Software Quality Journal.

[46]  Peter E. Bulychev,et al.  Duplicate code detection using anti-unification , 2008 .

[47]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[48]  Andrian Marcus,et al.  Supporting program comprehension using semantic and structural information , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.

[49]  Thorsten Joachims,et al.  Supervised clustering with support vector machines , 2005, ICML.

[50]  Spiros Mancoridis,et al.  Automatic clustering of software systems using a genetic algorithm , 1999, STEP '99. Proceedings Ninth International Workshop Software Technology and Engineering Practice.

[51]  Chanchal Kumar Roy,et al.  NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[52]  Dalton Serey Guerrero,et al.  Comparison of Graph Clustering Algorithms for Recovering Software Architecture Module Views , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[53]  Nicolas Anquetil,et al.  Experiments with clustering as a software remodularization method , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[54]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..