A Framework for Studying Clones In Large Software Systems

Clones are code segments that have been created by copying-and-pasting from other code segments. Clones occur often in large software systems. It is reported that 5 to 50% of the source code of a large software system is cloned. A major challenge when studying code cloning in large software systems is handling the large amount of clone candidates produced by leading edge clone detection tools. For example, the CCFinder, clone detection tool, produces over 7 million pairs of clone candidates for the Linux kernel (which consists of over 4MLOC). Moreover, the output of clone detection tools grows rapidly as a software system evolves. Researchers and developers need tools to help them study the large amount of clone data in order to better understand the clone phenomena in large systems. In this paper, we propose a data mining framework to help researchers cope with the large amount of data produced by clone detection tools. We propose techniques to reduce, abstract and highlight the most interesting data produced by clone detection tools. Our framework also introduces a visualization tool which allows users to query and explore clone data at various abstraction levels. We demonstrate our framework on a case study of the clone phenomena in the Linux kernel.

[1]  Elizabeth Burd,et al.  Evaluating clone detection tools for use during preventative maintenance , 2002, Proceedings. Second IEEE International Workshop on Source Code Analysis and Manipulation.

[2]  Akito Monden,et al.  Software quality analysis by code clones in industrial legacy software , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[3]  Richard C. Holt,et al.  Visualizing Clone Cohesion and Coupling , 2006, 2006 13th Asia Pacific Software Engineering Conference (APSEC'06).

[4]  Michael W. Godfrey,et al.  Evolution in open source software: a case study , 2000, Proceedings 2000 International Conference on Software Maintenance.

[5]  Michael W. Godfrey,et al.  Cloning by accident: an empirical study of source code cloning across software systems , 2005, 2005 International Symposium on Empirical Software Engineering, 2005..

[6]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[7]  James R. Cordy,et al.  Comprehending reality - practical barriers to industrial adoption of software maintenance automation , 2003, 11th IEEE International Workshop on Program Comprehension, 2003..

[8]  Stan Jarzabek,et al.  Detecting higher-level similarity patterns in programs , 2005, ESEC/FSE-13.

[9]  Michael W. Godfrey,et al.  "Cloning Considered Harmful" Considered Harmful , 2006, 2006 13th Working Conference on Reverse Engineering.

[10]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[11]  Jesús M. González-Barahona,et al.  Evolution and growth in large libre software projects , 2005, Eighth International Workshop on Principles of Software Evolution (IWPSE'05).

[12]  James R. Cordy Comprehending Reality: Practical Challenges to Software Maintenance Automation , 2003 .

[13]  Damith C. Rajapakse,et al.  Beyond templates: a study of clones in the STL and some general implications , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[14]  Arie van Deursen,et al.  An evaluation of clone detection techniques for crosscutting concerns , 2004, 20th IEEE International Conference on Software Maintenance, 2004. Proceedings..

[15]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[16]  Ettore Merlo,et al.  Assessing the benefits of incorporating function clone detection in a development process , 1997, 1997 Proceedings International Conference on Software Maintenance.

[17]  Jeffrey G. Gray,et al.  Visualization of clone detection results , 2006, ETX.

[18]  J. Howard Johnson,et al.  Visualizing textual redundancy in legacy source , 1994, CASCON.

[19]  Michael W. Godfrey,et al.  Supporting the analysis of clones in software systems , 2006, J. Softw. Maintenance Res. Pract..

[20]  Jonathan Helfman,et al.  Dotplot Patterns: A Literal Look at Pattern Languages , 1996, Theory Pract. Object Syst..

[21]  Renato De Mori,et al.  Pattern matching for clone and concept detection , 2004, Automated Software Engineering.

[22]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[23]  James R. Cordy,et al.  Practical language-independent detection of near-miss clones , 2004, CASCON.

[24]  Kostas Kontogiannis,et al.  Evaluation experiments on the detection of programming patterns using software metrics , 1997, Proceedings of the Fourth Working Conference on Reverse Engineering.

[25]  Miryung Kim,et al.  An empirical study of code clone genealogies , 2005, ESEC/FSE-13.

[26]  Dean W. Gonzalez,et al.  “=” considered harmful , 1991, ALET.

[27]  Peggy Wright,et al.  Knowledge discovery in databases: tools and techniques , 1998, CROS.

[28]  Andrew Begel,et al.  Managing Duplicated Code with Linked Editing , 2004, 2004 IEEE Symposium on Visual Languages - Human Centric Computing.

[29]  J. Howard Johnson,et al.  Navigating the textual redundancy web in legacy source , 1996, CASCON.

[30]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[31]  J. Howard Johnson,et al.  Substring matching for clone detection and change tracking , 1994, Proceedings 1994 International Conference on Software Maintenance.

[32]  Miryung Kim,et al.  An ethnographic study of copy and paste programming practices in OOPL , 2004, Proceedings. 2004 International Symposium on Empirical Software Engineering, 2004. ISESE '04..

[33]  Stéphane Ducasse,et al.  Insights into system-wide code duplication , 2004, 11th Working Conference on Reverse Engineering.