TECCD: A Tree Embedding Approach for Code Clone Detection

Clone detection techniques have been explored for decades. Recently, deep learning techniques has been adopted to improve the code representation capability, and improve the state-of-the-art in code clone detection. These approaches usually require a transformation from AST to binary tree to incorporate syntactical information, which introduces overheads. Moreover, these approaches conduct term-embedding, which requires large training datasets. In this paper, we introduce a tree embedding technique to conduct clone detection. Our approach first conducts tree embedding to obtain a node vector for each intermediate node in the AST, which captures the structure information of ASTs. Then we compose a tree vector from its involving node vectors using a lightweight method. Lastly Euclidean distances between tree vectors are measured to determine code clones. We implement our approach in a tool called TECCD and conduct an evaluation using the BigCloneBench (BCB) and 7 other large scale Java projects. The results show that our approach achieves good accuracy and recall and outperforms existing approaches.

[1]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[2]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[3]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[4]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[5]  Nicholas Tran,et al.  Sim: a utility for detecting similarity in computer programs , 1999, SIGCSE '99.

[6]  Christoph Goller,et al.  Learning task-dependent distributed representations by backpropagation through structure , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[7]  Kostas Kontogiannis,et al.  Evaluation experiments on the detection of programming patterns using software metrics , 1997, Proceedings of the Fourth Working Conference on Reverse Engineering.

[8]  Barbara G. Ryder,et al.  CCLearner: A Deep Learning-Based Clone Detection Approach , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[9]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[10]  J. Howard Johnson,et al.  Identifying redundancy in source code using fingerprints , 1993, CASCON.

[11]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[12]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[13]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[14]  Manishankar Mondal,et al.  An Empirical Study of the Impacts of Clones in Software Maintenance , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[15]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[16]  Chanchal Kumar Roy,et al.  The vision of software clone management: Past, present, and future (Keynote paper) , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[17]  Siau-Cheng Khoo,et al.  Scalable detection of missed cross-function refactorings , 2014, ISSTA 2014.

[18]  Selvadurai Kanmani,et al.  Method-level code clone detection through LWH (Light Weight Hybrid) approach , 2014, Journal of Software Engineering Research and Development.

[19]  Baojiang Cui,et al.  A Code Comparison Algorithm Based on AST for Plagiarism Detection , 2013, 2013 Fourth International Conference on Emerging Intelligent Data and Web Technologies.

[20]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[21]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[22]  Shinji Kusumoto,et al.  Gapped code clone detection with lightweight source code analysis , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[23]  Siau-Cheng Khoo,et al.  Vector abstraction and concretization for scalable detection of refactorings , 2014, FSE 2014.

[24]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[25]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[26]  Chanchal Kumar Roy,et al.  Towards a Big Data Curated Benchmark of Inter-project Code Clones , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[27]  Katsuro Inoue,et al.  Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder , 2007, 29th International Conference on Software Engineering (ICSE'07).

[28]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[29]  Dietmar Seipel,et al.  Clone detection in source code by frequent itemset techniques , 2004 .

[30]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[31]  Maninder Singh,et al.  Software clone detection: A systematic review , 2013, Inf. Softw. Technol..

[32]  Akito Monden,et al.  Software quality analysis by code clones in industrial legacy software , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[33]  Chanchal Kumar Roy,et al.  NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[34]  Matthias Rieger,et al.  Effective Clone Detection Without Language Barriers , 2005 .

[35]  Michael W. Godfrey,et al.  “Cloning considered harmful” considered harmful: patterns of cloning in software , 2008, Empirical Software Engineering.

[36]  Radu Marinescu,et al.  Archeology of code duplication: recovering duplication chains from small duplication fragments , 2005, Seventh International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC'05).

[37]  Wuu Yang,et al.  Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[38]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[39]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[40]  J. Howard Johnson,et al.  Substring matching for clone detection and change tracking , 1994, Proceedings 1994 International Conference on Software Maintenance.

[41]  Ming Li,et al.  Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code , 2017, IJCAI.

[42]  Abraham Bernstein,et al.  Detecting similar Java classes using tree algorithms , 2006, MSR '06.

[43]  Bhavani Thuraisingham,et al.  Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security , 2017, CCS.

[44]  Renato De Mori,et al.  Pattern matching for clone and concept detection , 2004, Automated Software Engineering.

[45]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[46]  William F. Smyth,et al.  Efficient token based clone detection with flexible tokenization , 2007, ESEC-FSE companion '07.

[47]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[48]  Rainer Koschke,et al.  Incremental Clone Detection , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[49]  Ettore Merlo,et al.  Experiment on the automatic detection of function clones in a software system using metrics , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[50]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[51]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[52]  Jugal K. Kalita,et al.  Expert Systems With Applications , 2022 .

[53]  Artur Andrzejak,et al.  Learning-Based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection , 2019, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[54]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[55]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[56]  Chanchal Kumar Roy,et al.  BigCloneEval: A Clone Detection Tool Evaluation Framework with BigCloneBench , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[57]  Hirohide Haga,et al.  Code Clone Detection Method Based on the Combination of Tree-Based and Token-Based Methods , 2017 .

[58]  Yang Liu,et al.  subgraph2vec: Learning Distributed Representations of Rooted Sub-graphs from Large Graphs , 2016, ArXiv.

[59]  Yanzhao Wu,et al.  CCAligner: A Token Based Large-Gap Clone Detector , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[60]  P. Danielsson Euclidean distance mapping , 1980 .

[61]  Yuanyuan Zhou,et al.  CP-Miner: finding copy-paste and related bugs in large-scale software code , 2006, IEEE Transactions on Software Engineering.