Neural Detection of Semantic Code Clones Via Tree-Based Convolution

Code clones are similar code fragments that share the same semantics but may differ syntactically to various degrees. Detecting code clones helps reduce the cost of software maintenance and prevent faults. Various approaches of detecting code clones have been proposed over the last two decades, but few of them can detect semantic clones, i.e., code clones with dissimilar syntax. Recent research has attempted to adopt deep learning for detecting code clones, such as using tree-based LSTM over Abstract Syntax Tree (AST). However, it does not fully leverage the structural information of code fragments, thereby limiting its clone-detection capability. To fully unleash the power of deep learning for detecting code clones, we propose a new approach that uses tree-based convolution to detect semantic clones, by capturing both the structural information of a code fragment from its AST and lexical information from code tokens. Additionally, our approach addresses the limitation that source code has an unlimited vocabulary of tokens and models, and thus exploiting lexical information from code tokens is often ineffective when dealing with unseen tokens. Particularly, we propose a new embedding technique called position-aware character embedding (PACE), which essentially treats any token as a position-weighted combination of character one-hot embeddings. Our experimental results show that our approach substantially outperforms an existing state-of-the-art approach with an increase of 0.42 and 0.15 in F1-score on two popular code-clone benchmarks (OJClone and BigCloneBench), respectively, while being more computationally efficient. Our experimental results also show that PACE enables our approach to be substantially more effective when code clones contain unseen tokens.

[1]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[2]  Premkumar T. Devanbu,et al.  Are deep neural networks the best choice for modeling source code? , 2017, ESEC/SIGSOFT FSE.

[3]  Zhi Jin,et al.  Building Program Vector Representations for Deep Learning , 2014, KSEM.

[4]  Collin McMillan,et al.  Automatically generating commit messages from diffs using neural machine translation , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[5]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[6]  Chanchal Kumar Roy,et al.  Towards a Big Data Curated Benchmark of Inter-project Code Clones , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[7]  Chanchal Kumar Roy,et al.  A Mutation/Injection-Based Automatic Framework for Evaluating Code Clone Detection Tools , 2009, 2009 International Conference on Software Testing, Verification, and Validation Workshops.

[8]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[9]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[10]  Shuai Lu,et al.  Summarizing Source Code with Transferred API Knowledge , 2018, IJCAI.

[11]  Ming Li,et al.  Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code , 2017, IJCAI.

[12]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[13]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[14]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[15]  Rainer Koschke,et al.  Clone Detection Using Abstract Syntax Suffix Trees , 2006, 2006 13th Working Conference on Reverse Engineering.

[16]  Heejung Kim,et al.  MeCC: memory comparison-based clone detector , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[17]  Barbara G. Ryder,et al.  CCLearner: A Deep Learning-Based Clone Detection Approach , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[18]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[19]  Zhendong Su,et al.  Automatic mining of functionally equivalent code fragments via random testing , 2009, ISSTA.

[20]  Chanchal Kumar Roy,et al.  Benchmarks for software clone detection: A ten-year retrospective , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[21]  Ming Li,et al.  Positive and Unlabeled Learning for Detecting Software Functional Clones with Adversarial Training , 2018, IJCAI.

[22]  Chanchal Kumar Roy,et al.  Evaluating clone detection tools with BigCloneBench , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[23]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[24]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.