Practical language-independent detection of near-miss clones

Previous research shows that most software systems contain significant amounts of duplicated, or cloned, code. Some clones are exact duplicates of each other, while others differ in small details only. We designate these almost-perfect clones as "near-miss" clones. While technically difficult, detection of near-miss clones has many benefits, both academic and practical. Finding these clones can give us better insight into the way developers maintain and reuse code, and we can also parameterize and remove near-miss clones to reduce overall source code size and decrease system complexity. This paper presents a simple, general and practical way to detect near-miss clones, and summarizes the results of its application to two production websites. We use standard lexical comparison tools coupled with language-specific extractors to locate potential clones. Our approach separates code comparisons from code understanding, and makes the comparisons language independent. This makes it easy to adapt to different programming languages.

[1]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[2]  Thomas G. Szymanski,et al.  A fast algorithm for computing longest common subsequences , 1977, CACM.

[3]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[4]  Geoff W. Hamilton,et al.  Hypertext: The Next Maintenance Mountain , 1998, Computer.

[5]  J. W. Hunt,et al.  An Algorithm for Differential File Comparison , 2008 .

[6]  James R. Cordy,et al.  The TXL Programming Language-Version 10 , 2000 .

[7]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[8]  Cornelia Boldyreff,et al.  Reverse engineering to achieve maintainable WWW sites , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[9]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[10]  Bruno Laguë,et al.  DATRIX Abstract Semantic Graph Reference Manual , 1999 .

[11]  Kevin A. Schneider,et al.  Agile Parsing in TXL , 2004, Automated Software Engineering.

[12]  Renato De Mori,et al.  Pattern matching for clone and concept detection , 2004, Automated Software Engineering.

[13]  M. Di Penta,et al.  Identifying clones in the Linux kernel , 2001, Proceedings First IEEE International Workshop on Source Code Analysis and Manipulation.

[14]  James R. Cordy,et al.  Resolution of static clones in dynamic Web pages , 2003, Fifth IEEE International Workshop on Web Site Evolution, 2003. Theme: Architecture. Proceedings..

[15]  Rajiv Gupta,et al.  Code Compaction of Matching Single-Entry Multiple-Exit Regions , 2003, SAS.

[16]  L. Sridevi,et al.  Clone Detection Using Abstract Syntax Trees , 2016 .

[17]  James R. Cordy,et al.  Robust multilingual parsing using island grammars , 2003, CASCON.

[18]  Neil Davey,et al.  The development of a software clone detector , 1995 .

[19]  Ettore Merlo,et al.  Experiment on the automatic detection of function clones in a software system using metrics , 1996, 1996 Proceedings of International Conference on Software Maintenance.