Modeling Functional Similarity in Source Code With Graph-Based Siamese Networks

Code clones are duplicate code fragments that share (nearly) similar syntax or semantics. Code clone detection plays an important role in software maintenance, code refactoring, and reuse. A substantial amount of research has been conducted in the past to detect clones. A majority of these approaches use lexical and syntactic information to detect clones. However, only a few of them target semantic clones. Recently, motivated by the success of deep learning models in other fields, including natural language processing and computer vision, researchers have attempted to adopt deep learning techniques to detect code clones. These approaches use lexical information (tokens) and(or) syntactic structures like abstract syntax trees (ASTs) to detect code clones. However, they do not make sufficient use of the available structural and semantic information hence, limiting their capabilities. This paper addresses the problem of semantic code clone detection using program dependency graphs and geometric neural networks, leveraging the structured syntactic and semantic information. We have developed a prototype tool HOLMES, based on our novel approach, and empirically evaluated it on popular code clone benchmarks. Our results show that HOLMES performs considerably better than the other state-of-the-art tool, TBCCD. We also evaluated HOLMES on unseen projects and performed cross dataset experiments to assess the generalizability of HOLMES. Our results affirm that HOLMES outperforms TBCCD since most of the pairs that HOLMES detected were either undetected or suboptimally reported by TBCCD.

[1]  Dawn Xiaodong Song,et al.  Improving Neural Program Synthesis with Inferred Execution Traces , 2018, NeurIPS.

[2]  Cristina V. Lopes,et al.  Oreo: detection of clones in the twilight zone , 2018, ESEC/SIGSOFT FSE.

[3]  Brenda S. Baker,et al.  A Program for Identifying Duplicated Code , 1992 .

[4]  Zhendong Su,et al.  Context-based detection of clone-related bugs , 2007, ESEC-FSE '07.

[5]  Pierre Baldi,et al.  Neural Networks for Fingerprint Recognition , 1993, Neural Computation.

[6]  Quoc V. Le,et al.  Neural Symbolic Reader: Scalable Integration of Distributed and Symbolic Representations for Reading Comprehension , 2020, ICLR.

[7]  David W. Binkley,et al.  Source Code Analysis: A Road Map , 2007, Future of Software Engineering (FOSE '07).

[8]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[9]  Collin McMillan,et al.  Improved Code Summarization via a Graph Neural Network , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[10]  Michael W. Godfrey,et al.  "Cloning Considered Harmful" Considered Harmful , 2006, 2006 13th Working Conference on Reverse Engineering.

[11]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[12]  Ming Li,et al.  Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code , 2017, IJCAI.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[15]  Zhang Tao,et al.  CPGVA: Code Property Graph based Vulnerability Analysis by Deep Learning , 2018, 2018 10th International Conference on Advanced Infocomm Technology (ICAIT).

[16]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[17]  Barbara G. Ryder,et al.  CCLearner: A Deep Learning-Based Clone Detection Approach , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[18]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[19]  Long Chen,et al.  Neural Detection of Semantic Code Clones Via Tree-Based Convolution , 2019, 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC).

[20]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[21]  Chanchal Kumar Roy,et al.  Towards a Big Data Curated Benchmark of Inter-project Code Clones , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[22]  Gail E. Kaiser,et al.  Code relatives: detecting similarly behaving software , 2016, SIGSOFT FSE.

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Ke Wang,et al.  Dynamic Neural Program Embedding for Program Repair , 2017, ICLR.

[25]  Gang Zhao,et al.  DeepSim: deep learning code functional similarity , 2018, ESEC/SIGSOFT FSE.

[26]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[27]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[28]  Andrian Marcus,et al.  Identification of high-level concept clones in source code , 2001, Proceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001).

[29]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[30]  John Cocke,et al.  A program data flow analysis procedure , 1976, CACM.

[31]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[32]  Juanru Li,et al.  Binary Code Clone Detection across Architectures and Compiling Configurations , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[33]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[34]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Gabriele Bavota,et al.  On Learning Meaningful Code Changes Via Neural Machine Translation , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[36]  Nikita Mehrotra,et al.  JCoffee: Using Compiler Feedback to Make Partial Code Snippets Compilable , 2020, 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[37]  Yuanyuan Zhou,et al.  CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[38]  Ke Wang,et al.  Learning Scalable and Precise Representation of Program Semantics , 2019, ArXiv.

[39]  Manishankar Mondal,et al.  Does cloned code increase maintenance effort? , 2017, 2017 IEEE 11th International Workshop on Software Clones (IWSC).

[40]  Ming Li,et al.  Positive and Unlabeled Learning for Detecting Software Functional Clones with Adversarial Training , 2018, IJCAI.

[41]  Lerina Aversano,et al.  An empirical study on the maintenance of source code clones , 2010, Empirical Software Engineering.

[42]  Torsten Hoefler,et al.  Neural Code Comprehension: A Learnable Representation of Code Semantics , 2018, NeurIPS.

[43]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[44]  Le Song,et al.  Structure2vec: Deep Learning for Security Analytics over Graphs , 2018 .

[45]  Ondrej Lhoták,et al.  The Soot framework for Java program analysis: a retrospective , 2011 .

[46]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[47]  Kathryn T. Stolee,et al.  SLACC: Simion-based Language Agnostic Code Clones , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[48]  Rahul Gupta,et al.  DeepFix: Fixing Common C Language Errors by Deep Learning , 2017, AAAI.

[49]  Ying Zou,et al.  Spotting working code examples , 2014, ICSE.

[50]  Chanchal Kumar Roy,et al.  Towards flexible code clone detection, management, and refactoring in IDE , 2011, IWSC '11.

[51]  Akito Monden,et al.  Software quality analysis by code clones in industrial legacy software , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[52]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[53]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[54]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[55]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[56]  Chanchal Kumar Roy,et al.  The NiCad Clone Detector , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[57]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[58]  Gabriele Bavota,et al.  Deep Learning Similarities from Different Representations of Source Code , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[59]  Chanchal Kumar Roy,et al.  A Mutation/Injection-Based Automatic Framework for Evaluating Code Clone Detection Tools , 2009, 2009 International Conference on Software Testing, Verification, and Validation Workshops.

[60]  Susan Horwitz,et al.  Identifying the semantic and textual differences between two versions of a program , 1990, PLDI '90.

[61]  Dirk Grunwald,et al.  Data flow equations for explicitly parallel programs , 1993, PPOPP '93.

[62]  Shaohua Wang,et al.  Improving bug detection via context-based code representation learning and attention-based neural networks , 2019, Proc. ACM Program. Lang..

[63]  Zhi Jin,et al.  Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree , 2020, 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[64]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[65]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[66]  Xiaolong Li,et al.  GeniePath: Graph Neural Networks with Adaptive Receptive Paths , 2018, AAAI.

[67]  Pierre Vandergheynst,et al.  Geometric Deep Learning: Going beyond Euclidean data , 2016, IEEE Signal Process. Mag..

[68]  David Lo,et al.  CC2Vec: Distributed Representations of Code Changes , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[69]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[70]  Si Zhang,et al.  Graph convolutional networks: a comprehensive review , 2019, Computational Social Networks.

[71]  Pushmeet Kohli,et al.  Graph Matching Networks for Learning the Similarity of Graph Structured Objects , 2019, ICML.

[72]  Joan Bruna,et al.  Spectral Networks and Locally Connected Networks on Graphs , 2013, ICLR.

[73]  Salwa K. Abd-El-Hafiz,et al.  A Metrics-Based Data Mining Approach for Software Clone Detection , 2012, 2012 IEEE 36th Annual Computer Software and Applications Conference.

[74]  Lori A. Clarke,et al.  The implications of program dependencies for software testing, debugging, and maintenance , 1989, TAV3.

[75]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[76]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[77]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[78]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[79]  Stefan Wagner,et al.  How are functionally similar code clones syntactically different? An empirical study and a benchmark , 2016, PeerJ Comput. Sci..

[80]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[81]  Juanru Li,et al.  BinMatch: A Semantics-Based Hybrid Approach on Binary Code Clone Analysis , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[82]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[83]  Alessio Micheli,et al.  Neural Network for Graphs: A Contextual Constructive Approach , 2009, IEEE Transactions on Neural Networks.

[84]  Michael Philippsen,et al.  SeSaMe: A Data Set of Semantically Similar Java Methods , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[85]  Ken-ichi Kawarabayashi,et al.  Representation Learning on Graphs with Jumping Knowledge Networks , 2018, ICML.