Finding repeated strings in code repositories and its applications to code-clone detection

Although researchers have created many advanced code-clone detection techniques, more effort is required to realize wide adaptation of these techniques in the industry. One of the reasons behind this is the reliance of these advanced techniques on lexing and parsing programs. Modern programming languages have complex lexical conventions and grammar, which evolve constantly. Therefore, using advanced code-clone detection techniques requires substantial and continuous effort. This paper proposes a lightweight language-independent method to detect code clones by simply finding repeated strings in a code repository, relying on neither lexing nor parsing. The proposed method is based on an efficient technique developed in a bio-informatics context to find repeated strings. We refer to the repeated strings in the source-code as weak Type-1 clones. Because the proposed technique normalizes newlines, tabs, and white spaces into a single white space, it can find clones in which newline positions or indentations are changed, as often in the case when copy-pasting occurs. Although the proposed method only finds verbatim copies, it also makes interesting observations regarding repository structures. Many developers may prefer the proposed simple approach because it is easier to understand than other advanced techniques that use heuristics, approximation, and machine learning.

[1]  Chanchal K. Roy,et al.  A Survey on the Evaluation of Clone Detection Performance and Benchmarking , 2020, ArXiv.

[2]  J. Krinke,et al.  Siamese: scalable and incremental code clone search via multiple code representations , 2019, Empirical Software Engineering.

[3]  Cristina V. Lopes,et al.  Oreo: detection of clones in the twilight zone , 2018, ESEC/SIGSOFT FSE.

[4]  Yanzhao Wu,et al.  CCAligner: A Token Based Large-Gap Clone Detector , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[5]  Chanchal Kumar Roy,et al.  BigCloneEval: A Clone Detection Tool Evaluation Framework with BigCloneBench , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[6]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[7]  Chanchal Kumar Roy,et al.  Evaluating clone detection tools with BigCloneBench , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[8]  Verónica Becher,et al.  Efficient repeat finding in sets of strings via suffix arrays , 2013, Discret. Math. Theor. Comput. Sci..

[9]  Enno Ohlebusch,et al.  Space-Efficient Computation of Maximal and Supermaximal Repeats in Genome Sequences , 2012, SPIRE.

[10]  Jeffrey Scott Vitter,et al.  Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Chanchal Kumar Roy,et al.  The NiCad Clone Detector , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[12]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[13]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[14]  D. Gusfield Algorithms on Stings, Trees, and Sequences: Computer Science and Computational Biology , 1997, SIGA.