Visualization and analysis of software clones

Code clones are identical or similar fragments of code in a software system. Simple copy-paste programming practices of developers, reusing existing code fragments instead of implementing from the scratch, limitations of both programming languages and developers are the primary reasons behind code cloning. Despite the maintenance implications of clones, it is not possible to conclude that cloning is harmful because there are also benefits in using them (e.g. faster and independent development). As a result, researchers at least agree that clones need to be analyzed before aggressively refactoring them. Although a large number of state-of-the-art clone detectors are available today, handling raw clone data is challenging due to the textual nature and large volume. To address this issue, we propose a framework for large-scale clone analysis and develop a maintenance support environment based on the framework called VisCad. To manage the large volume of clone data, VisCad employs the Visual Information Seeking Mantra: overview first, zoom and filter, then provide details-on-demand. With VisCad users can analyze and identify distinctive code clones through a set of visualization techniques, metrics covering different clone relations and data filtering operations. The loosely coupled architecture of VisCad allows users to work with any clone detection tool that reports source-coordinates of the found clones. This yields the opportunity to work with the clone detectors of choice, which is important because each clone detector has its own strengths and weaknesses. In addition, we extend the support for clone evolution analysis, which is important to understand the cause and effect of changes at the clone level during the evolution of a software system. Such information can be used to make software maintenance decisions like when to refactor clones. We propose and implement a set of visualizations that can allow users to analyze the evolution of clones from a coarse grain to a fine grain level. Finally, we use VisCad to extract both spatial and temporal clone data to predict changes to clones in a future release/revision of the software, which can be used to rank clone classes as another means of handling a large volume of clone data. We believe that VisCad makes clone comprehension easier and it can be used as a test-bed to further explore code cloning, necessary in building a successful clone management system.

[1]  Giuliano Antoniol,et al.  Analyzing cloning evolution in the Linux kernel , 2002, Inf. Softw. Technol..

[2]  Brenda S. Baker,et al.  A Program for Identifying Duplicated Code , 1992 .

[3]  Miryung Kim,et al.  An empirical study of code clone genealogies , 2005, ESEC/FSE-13.

[4]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[5]  Chanchal Kumar Roy,et al.  An Empirical Study of Function Clones in Open Source Software , 2008, 2008 15th Working Conference on Reverse Engineering.

[6]  Andrian Marcus,et al.  Identification of high-level concept clones in source code , 2001, Proceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001).

[7]  Inderpal Singh Mumick,et al.  The Dynamic Homefinder: Evaluating Dynamic Queries In A Real-Estate Information Exploration System , 1999 .

[8]  Catherine Plaisant,et al.  Dynamaps: dynamic queries on a health statistics atlas , 1994, CHI '94.

[9]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[10]  Yuanyuan Zhou,et al.  CP-Miner: finding copy-paste and related bugs in large-scale software code , 2006, IEEE Transactions on Software Engineering.

[11]  Michel Wermelinger,et al.  Assessing the effect of clones on changeability , 2008, 2008 IEEE International Conference on Software Maintenance.

[12]  Stan Jarzabek,et al.  A Data Mining Approach for Detecting Higher-Level Clones in Software , 2009, IEEE Transactions on Software Engineering.

[13]  Eytan Adar,et al.  GUESS: a language and interface for graph exploration , 2006, CHI.

[14]  Ben Shneiderman,et al.  Visualization methods for personal photo collections: browsing and searching in the PhotoFinder , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[15]  Lucian Voinea,et al.  An open framework for CVS repository querying, analysis and visualization , 2006, MSR '06.

[16]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[17]  Michele Lanza,et al.  Visual Exploration of Large-Scale System Evolution , 2008, 2008 15th Working Conference on Reverse Engineering.

[18]  Ettore Merlo,et al.  Experiment on the automatic detection of function clones in a software system using metrics , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[19]  Hoan Anh Nguyen,et al.  Complete and accurate clone detection in graph-based models , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[20]  Leon Moonen,et al.  Generating robust parsers using island grammars , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[21]  Chanchal Kumar Roy,et al.  The NiCad Clone Detector , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[22]  Hoan Anh Nguyen,et al.  Accurate and Efficient Structural Characteristic Feature Extraction for Clone Detection , 2009, FASE.

[23]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[24]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[25]  Michele Lanza,et al.  The evolution matrix: recovering software evolution using software visualization techniques , 2001, IWPSE '01.

[26]  Katsuro Inoue,et al.  Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder , 2007, 29th International Conference on Software Engineering (ICSE'07).

[27]  Harald C. Gall,et al.  Fractal Figures: Visualizing Development Effort for CVS Entities , 2005, 3rd IEEE International Workshop on Visualizing Software for Understanding and Analysis.

[28]  Rainer Koschke,et al.  Frequency and risks of changes to clones , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[29]  Chanchal Kumar Roy,et al.  A Mutation/Injection-Based Automatic Framework for Evaluating Code Clone Detection Tools , 2009, 2009 International Conference on Software Testing, Verification, and Validation Workshops.

[30]  Daniel J. Quinlan,et al.  Detecting code clones in binary executables , 2009, ISSTA.

[31]  Jeffrey G. Gray,et al.  Visualization of clone detection results , 2006, ETX.

[32]  J. Howard Johnson,et al.  Visualizing textual redundancy in legacy source , 1994, CASCON.

[33]  Harald C. Gall,et al.  Visualizing multiple evolution metrics , 2005, SoftVis '05.

[34]  Chanchal Kumar Roy,et al.  VisCad: flexible code clone analysis support for NiCad , 2011, IWSC '11.

[35]  Damith C. Rajapakse,et al.  Beyond templates: a study of clones in the STL and some general implications , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[36]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[37]  Christopher Williamson,et al.  Dynamic queries for information exploration: an implementation and evaluation , 1992, CHI.

[38]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[39]  Michael W. Godfrey,et al.  From Whence It Came: Detecting Source Code Clones by Analyzing Assembler , 2010, 2010 17th Working Conference on Reverse Engineering.

[40]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[41]  Miryung Kim,et al.  An Empirical Study of Long-Lived Code Clones , 2011, FASE.

[42]  J. Rodgers,et al.  Thirteen ways to look at the correlation coefficient , 1988 .

[43]  Elmar Jürgens Research in cloning beyond code: a first roadmap , 2011, IWSC '11.

[44]  James R. Cordy Comprehending Reality: Practical Challenges to Software Maintenance Automation , 2003 .

[45]  Jonathan Helfman,et al.  Dotplot Patterns: A Literal Look at Pattern Languages , 1996, Theory Pract. Object Syst..

[46]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[47]  Miryung Kim,et al.  SoftGUESS: Visualization and Exploration of Code Clones in Context , 2007, 29th International Conference on Software Engineering (ICSE'07).

[48]  Nils Göde,et al.  Efficiently handling clone data: RCF and cyclone , 2011, IWSC '11.

[49]  Bernhard Schätz,et al.  Clone detection in automotive model-based development , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[50]  Ettore Merlo,et al.  Assessing the benefits of incorporating function clone detection in a development process , 1997, 1997 Proceedings International Conference on Software Maintenance.

[51]  Shinji Kusumoto,et al.  On Software Maintenance Process Improvement Based on Code Clone Analysis , 2002, PROFES.

[52]  Michael W. Godfrey,et al.  Aiding comprehension of cloning through categorization , 2004 .

[53]  Harald Störrle Towards clone detection in UML domain models , 2010, ECSA '10.

[54]  Lerina Aversano,et al.  An empirical study on the maintenance of source code clones , 2010, Empirical Software Engineering.

[55]  Chanchal Kumar Roy,et al.  Towards a mutation-based automatic framework for evaluating code clone detection tools , 2008, C3S2E '08.

[56]  Lerina Aversano,et al.  How Clones are Maintained: An Empirical Study , 2007, 11th European Conference on Software Maintenance and Reengineering (CSMR'07).

[57]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[58]  Hagen Hagen Is Cloned Code more stable than Non-Cloned Code? , 2008 .

[59]  Zhiyi Ma,et al.  Detecting Duplications in Sequence Diagrams Based on Suffix Trees , 2006, 2006 13th Asia Pacific Software Engineering Conference (APSEC'06).

[60]  Ying Zou,et al.  An Empirical Study on Inconsistent Changes to Code Clones at Release Level , 2009, 2009 16th Working Conference on Reverse Engineering.

[61]  Michael W. Godfrey,et al.  Improved tool support for the investigation of duplication in software , 2005, 21st IEEE International Conference on Software Maintenance (ICSM'05).

[62]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[63]  Chanchal Kumar Roy,et al.  A Constraint Programming Approach to Conflict-Aware Optimal Scheduling of Prioritized Code Clone Refactoring , 2011, 2011 IEEE 11th International Working Conference on Source Code Analysis and Manipulation.

[64]  Stan Jarzabek,et al.  Query-based filtering and graphical view generation for clone analysis , 2008, 2008 IEEE International Conference on Software Maintenance.

[65]  J. Howard Johnson,et al.  Navigating the textual redundancy web in legacy source , 1996, CASCON.

[66]  Giuliano Antoniol,et al.  Linear complexity object-oriented similarity for clone detection and software evolution analyses , 2004, 20th IEEE International Conference on Software Maintenance, 2004. Proceedings..

[67]  Bernhard Schätz,et al.  Model clone detection in practice , 2010, IWSC '10.

[68]  Massimiliano Di Penta,et al.  An approach to identify duplicated web pages , 2002, Proceedings 26th Annual International Computer Software and Applications.

[69]  António Menezes Leitão Detection of Redundant Code Using R2D2 , 2004, Software Quality Journal.

[70]  Serge Demeyer,et al.  Studying software evolution information by visualizing the change history , 2004, 20th IEEE International Conference on Software Maintenance, 2004. Proceedings..

[71]  Chanchal Kumar Roy,et al.  Are scripting languages really different? , 2010, IWSC '10.

[72]  Magdalena Balazinska,et al.  Partial redesign of Java software systems based on clone analysis , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[73]  Chanchal K. Roy,et al.  Analyzing and Forecasting Near-Miss Clones in Evolving Software: An Empirical Study , 2011, 2011 16th IEEE International Conference on Engineering of Complex Computer Systems.

[74]  Daqing Hou,et al.  CReN: a tool for tracking copy-and-paste code clones and renaming identifiers consistently in the IDE , 2007, eclipse '07.

[75]  Michael W. Godfrey,et al.  "Cloning Considered Harmful" Considered Harmful , 2006, 2006 13th Working Conference on Reverse Engineering.

[76]  Elmar Jürgens,et al.  Do code clones matter? , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[77]  Andrew Begel,et al.  Managing Duplicated Code with Linked Editing , 2004, 2004 IEEE Symposium on Visual Languages - Human Centric Computing.

[78]  Lucian Voinea,et al.  CVSgrab: Mining the History of Large Software Projects , 2006, EuroVis.

[79]  Manishankar Mondal,et al.  An Empirical Study of the Impacts of Clones in Software Maintenance , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[80]  James R. Cordy,et al.  Practical language-independent detection of near-miss clones , 2004, CASCON.

[81]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[82]  Rainer Koschke,et al.  Clone Detection Using Abstract Syntax Suffix Trees , 2006, 2006 13th Working Conference on Reverse Engineering.

[83]  Nils Göde,et al.  Evolution of Type-1 Clones , 2009, 2009 Ninth IEEE International Working Conference on Source Code Analysis and Manipulation.

[84]  Shinji Kusumoto,et al.  ARIES: refactoring support tool for code clone , 2005, ACM SIGSOFT Softw. Eng. Notes.

[85]  Ben Shneiderman,et al.  Dynamic queries for visual information seeking , 1994, IEEE Software.

[86]  Maureen C. Stone,et al.  Enhanced dynamic queries via movable filters , 1995, CHI '95.

[87]  Stephen G. Eick,et al.  Seesoft-A Tool For Visualizing Line Oriented Software Statistics , 1992, IEEE Trans. Software Eng..

[88]  Jeffrey G. Gray,et al.  Phoenix-based clone detection using suffix trees , 2006, ACM-SE 44.

[89]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[90]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[91]  Christopher Ahlberg,et al.  Spotfire: an information exploration environment , 1996, SGMD.

[92]  Chanchal Kumar Roy,et al.  Conflict-Aware Optimal Scheduling of Code Clone Refactoring: A Constraint Programming Approach , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[93]  Ahmed E. Hassan,et al.  A Framework for Studying Clones In Large Software Systems , 2007, Seventh IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM 2007).

[94]  Christoph Domann,et al.  The curse of copy&paste — Cloning in requirements specifications , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[95]  Elizabeth Burd,et al.  Evaluating clone detection tools for use during preventative maintenance , 2002, Proceedings. Second IEEE International Workshop on Source Code Analysis and Manipulation.

[96]  Alexandru Telea,et al.  Combined visualization of structural and metric information for software evolution analysis , 2009, IWPSE-Evol '09.

[97]  Bashar Nuseibeh,et al.  Evaluating the Harmfulness of Cloning: A Change Based Experiment , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[98]  Christian S. Collberg,et al.  A system for graph-based visualization of the evolution of software , 2003, SoftVis '03.

[99]  James R. Cordy Live scatterplots , 2011, IWSC '11.

[100]  Stéphane Ducasse,et al.  Insights into system-wide code duplication , 2004, 11th Working Conference on Reverse Engineering.

[101]  Chanchal Kumar Roy,et al.  NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[102]  F. P. Brooks,et al.  The mythical man-month" essays on software engineering, addison-wesley , 1974 .

[103]  Matthias Rieger,et al.  Effective Clone Detection Without Language Barriers , 2005 .

[104]  J. Howard Johnson,et al.  Substring matching for clone detection and change tracking , 1994, Proceedings 1994 International Conference on Software Maintenance.

[105]  Audris Mockus,et al.  Visualizing Software Changes , 2002, IEEE Trans. Software Eng..

[106]  Miryung Kim,et al.  An ethnographic study of copy and paste programming practices in OOPL , 2004, Proceedings. 2004 International Symposium on Empirical Software Engineering, 2004. ISESE '04..

[107]  Daniel A. Keim,et al.  Information Visualization and Visual Data Mining , 2002, IEEE Trans. Vis. Comput. Graph..

[108]  James R. Cordy,et al.  Exploring Large-Scale System Similarity Using Incremental Clone Detection and Live Scatterplots , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[109]  Richard C. Holt,et al.  Visualizing Clone Cohesion and Coupling , 2006, 2006 13th Asia Pacific Software Engineering Conference (APSEC'06).

[110]  Meir M. Lehman,et al.  On understanding laws, evolution, and conservation in the large-program life cycle , 1984, J. Syst. Softw..

[111]  Hajimu Iida,et al.  Code Clone Graph Metrics for Detecting Diffused Code Clones , 2009, 2009 16th Asia-Pacific Software Engineering Conference.

[112]  Steven P. Reiss,et al.  Tracking source locations , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[113]  Premkumar T. Devanbu,et al.  Clones: What is that smell? , 2010, MSR.

[114]  Lucian Voinea,et al.  CVSscan: visualization of code evolution , 2005, SoftVis '05.

[115]  Filippo Lanubile,et al.  Function Clone Detection in Web Applications: A Semiautomated Approach , 2004, J. Web Eng..

[116]  Chanchal Kumar Roy,et al.  Scenario-Based Comparison of Clone Detection Techniques , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[117]  Nicholas Tran,et al.  Sim: a utility for detecting similarity in computer programs , 1999, SIGCSE '99.

[118]  Michael W. Godfrey,et al.  Clone detection by exploiting assembler , 2010, IWSC '10.

[119]  Chanchal Kumar Roy,et al.  Visualizing the evolution of code clones , 2011, IWSC '11.

[120]  Ben Shneiderman,et al.  Visual information seeking using the FilmFinder , 1994, CHI Conference Companion.

[121]  Radu Marinescu,et al.  Archeology of code duplication: recovering duplication chains from small duplication fragments , 2005, Seventh International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC'05).

[122]  Chanchal Kumar Roy,et al.  On the Effectiveness of Simhash for Detecting Near-Miss Clones in Large Scale Software Systems , 2011, 2011 18th Working Conference on Reverse Engineering.

[123]  Ben Shneiderman,et al.  Tree visualization with tree-maps: 2-d space-filling approach , 1992, TOGS.

[124]  Tibor Gyimóthy,et al.  Clone Smells in Software Evolution , 2007, 2007 IEEE International Conference on Software Maintenance.

[125]  J. Howard Johnson,et al.  Identifying redundancy in source code using fingerprints , 1993, CASCON.

[126]  Michel Dagenais,et al.  Extending software quality assessment techniques to Java systems , 1999, Proceedings Seventh International Workshop on Program Comprehension.

[127]  Manishankar Mondal,et al.  Comparative stability of cloned and non-cloned code: an empirical study , 2012, SAC '12.

[128]  Shinji Kusumoto,et al.  Gemini: maintenance support environment based on code clone analysis , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[129]  Chanchal Kumar Roy,et al.  Evaluating Code Clone Genealogies at Release Level: An Empirical Study , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[130]  Chanchal Kumar Roy,et al.  DebCheck: Efficient Checking for Open Source Code Clones in Software Systems , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[131]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[132]  Chanchal Kumar Roy,et al.  Near-miss function clones in open source software : an empirical study , 2009 .

[133]  Richard C. Holt,et al.  Evolution Spectrographs: visualizing punctuated change in software evolution , 2004 .

[134]  Ben Shneiderman,et al.  The eyes have it: a task by data type taxonomy for information visualizations , 1996, Proceedings 1996 IEEE Symposium on Visual Languages.

[135]  肥後 芳樹,et al.  Code clone analysis methods for efficient software maintenance , 2006 .

[136]  Renato De Mori,et al.  Pattern matching for clone and concept detection , 2004, Automated Software Engineering.

[137]  Chanchal Kumar Roy,et al.  An automatic framework for extracting and classifying near-miss clone genealogies , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[138]  Dirk Beyer,et al.  Evolution Storyboards: Visualization of Software Structure Dynamics , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[139]  Jens Krinke,et al.  A Study of Consistent and Inconsistent Changes to Code Clones , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[140]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[141]  Yeong-Gil Shin,et al.  Dynamic query interface for spatial proximity query with degree-of-interest varied by distance to query point , 2010, CHI.