Large-Scale Inter-System Clone Detection Using Suffix Trees

Detecting license violations of source code requires to compare a suspected system against a very large corpus of source code, for instance, the Debian source distribution. Thus, techniques detecting suspiciously similar code must scale in terms of resources needed. In addition to that, high precision of the detection is necessary because a human needs to inspect the results. The current approaches to address the resource challenge is to create an index for the corpus to which the suspected source code is compared. The index creation, however, is very costly. If the analysis is done only once, it may not be worth the effort. This paper demonstrates how suffix trees can be used to obtain a scalable comparison. Our evaluation shows that this approach is faster than current index-based techniques. In addition to that, this paper proposes a method to improve precision through user feedback and automated data mining.

[1]  Rainer Koschke,et al.  An extended assessment of type-3 clones as detected by state-of-the-art tools , 2011, Software Quality Journal.

[2]  Rainer Koschke,et al.  An evaluation of code similarity identification for the grow-and-prune model , 2009, CSMR 2009.

[3]  Rainer Koschke,et al.  An Assessment of Type-3 Clones as Detected by State-of-the-Art Tools , 2009, 2009 Ninth IEEE International Working Conference on Source Code Analysis and Manipulation.

[4]  William F. Smyth,et al.  Fast Optimal Algorithms for Computing All the Repeats in a String , 2008, Stringology.

[5]  Hajimu Iida,et al.  SHINOBI: A Tool for Automatic Code Clone Detection in the IDE , 2009, 2009 16th Working Conference on Reverse Engineering.

[6]  Rainer Koschke,et al.  Empirical evaluation of clone detection using syntax suffix trees , 2008, Empirical Software Engineering.

[7]  Iman Keivanloo,et al.  Internet-scale Real-time Code Clone Search Via Multi-level Indexing , 2011, 2011 18th Working Conference on Reverse Engineering.

[8]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[9]  Raimar Falke Erkennung von falsch-positiven Softwareklonen mittels Lernverfahren , 2014 .

[10]  Elmar Jürgens,et al.  CloneDetective - A workbench for clone detection research , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[11]  Simone Livieri,et al.  A needle in the stack: efficient clone detection for huge collections of source code , 2010 .

[12]  Rainer Koschke,et al.  Supporting the Grow-and-Prune Model in Software Product Lines Evolution Using Clone Detection , 2008, 2008 12th European Conference on Software Maintenance and Reengineering.

[13]  Yue Jia,et al.  Cloning and copying between GNOME projects , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[14]  Rainer Koschke,et al.  Approximate Code Search in Program Histories , 2011, 2011 18th Working Conference on Reverse Engineering.

[15]  Shinji Kusumoto,et al.  Toward identifying inter-project clone sets for building useful libraries , 2010, IWSC '10.

[16]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[17]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[18]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[19]  Martin P. Robillard,et al.  Clonetracker: tool support for code clone management , 2008, ICSE '08.

[20]  Thilo Mende On the evaluation of defect prediction models , 2011 .

[21]  Hajimu Iida,et al.  SHINOBI: A real-time code clone detection tool for software maintenance , 2008 .

[22]  Ying Zou,et al.  A Technique for Just-in-Time Clone Detection in Large Scale Systems , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[23]  James R. Cordy,et al.  Exploring Large-Scale System Similarity Using Incremental Clone Detection and Live Scatterplots , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[24]  Eugene L. Lawler,et al.  Sublinear Expected Time Approximate String Matching and Biological , 1991 .

[25]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[26]  Eugene L. Lawler,et al.  Sublinear approximate string matching and biological applications , 1994, Algorithmica.

[27]  Rainer Koschke,et al.  Survey of Research on Software Clones , 2006, Duplication, Redundancy, and Similarity in Software.

[28]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[29]  Katsuro Inoue,et al.  Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder , 2007, 29th International Conference on Software Engineering (ICSE'07).

[30]  Elmar Jürgens,et al.  Index-based code clone detection: incremental, distributed, scalable , 2010, 2010 IEEE International Conference on Software Maintenance.

[31]  Rainer Koschke,et al.  Clone Detection Using Abstract Syntax Suffix Trees , 2006, 2006 13th Working Conference on Reverse Engineering.

[32]  Nancy J. Mertzel Copying 0.03 percent of software code base not ‘de minimis’ , 2008 .

[33]  Brenda S. Baker,et al.  A Program for Identifying Duplicated Code , 1992 .

[34]  Ying Zou,et al.  A Technique for Just-InTime Clone Detection in Large Scale Systems , 2010 .

[35]  Shinji Kusumoto,et al.  Non-commercial Research and Educational Use including without Limitation Use in Instruction at Your Institution, Sending It to Specific Colleagues That You Know, and Providing a Copy to Your Institution's Administrator. All Other Uses, Reproduction and Distribution, including without Limitation Comm , 2022 .

[36]  Arie van Deursen,et al.  Managing code clones using dynamic change tracking and resolution , 2009, 2009 IEEE International Conference on Software Maintenance.

[37]  R. Koschke,et al.  Frontiers of software clone management , 2008, 2008 Frontiers of Software Maintenance.