LLVM-based code clone detection framework

Existed methods of code clones detection have some restrictions. Textual and lexical approaches cannot detect strongly modified fragments of code. Syntactic and metrics based approaches detect strong modifications with low accuracy. On the contrary, semantic approach accurately detects the cloned fragments of code with small changes as well as the strongly modified ones. Methods based on this approach are not scalable for analysis of large projects. This paper describes LLVM-based code clone detection framework, which uses program semantic analysis. It has high accuracy and is scalable for analysis million lines of source code. The tool embeds a testing system, which allows generating code clones for the project automatically. It is used for determining the developed algorithms accuracy. The instrument is applicable for all languages that can be compiled to LLVM bitcode. Proposed method was compared with two widely used tools MOSS and CloneDR. Results show that it has higher accuracy. The tool is scalable for analysis of Linux-2.6 kernel, which has about fourteen millions lines of source code.

[1]  Payal Gupta,et al.  Literature Survey of Clone Detection Techniques , 2014 .

[3]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[4]  Maninder Singh,et al.  Software clone detection: A systematic review , 2013, Inf. Softw. Technol..

[5]  Raminder Kaur,et al.  Clone detection in software source code using operational similarity of statements , 2014, SOEN.

[6]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[7]  Андрей Белеванцев,et al.  Масштабируемый инструмент поиска клонов кода на основе семантического анализа программ , 2015 .

[8]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[9]  V. V. Savchenko,et al.  Building obfuscation compiler based on LLVM infrastructure , 2012 .

[10]  Neil Davey,et al.  The development of a software clone detector , 1995 .

[11]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[12]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[13]  H. K. Aslanyan,et al.  Scalable and Accurate Clones Detection Based onMetrics for Dependence Graph , 2014 .

[14]  Ettore Merlo,et al.  Experiment on the automatic detection of function clones in a software system using metrics , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[15]  Chanchal Kumar Roy,et al.  An Empirical Study of Function Clones in Open Source Software , 2008, 2008 15th Working Conference on Reverse Engineering.

[16]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[17]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[18]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[19]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.