MISIM: An End-to-End Neural Code Similarity System

Code similarity systems are integral to a range of applications from code recommendation to automated construction of software tests and defect mitigation. In this paper, we present Machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components. First, MISIM uses a novel context-aware similarity structure, which is designed to aid in lifting semantic meaning from code syntax. Second, MISIM provides a neural-based code similarity scoring system, which can be implemented with various neural network algorithms and topologies with learned parameters. We compare MISIM to three other state-of-the-art code similarity systems: (i) code2vec, (ii) Neural Code Comprehension, and (iii) Aroma. In our experimental evaluation across 45,780 programs, MISIM consistently outperformed all three systems, often by a large factor (upwards of 40.6x).

[1]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[2]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[3]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[4]  Oleksandr Polozov,et al.  Generative Code Modeling with Graphs , 2018, ICLR.

[5]  Sumit Gulwani,et al.  Ringer: web automation by demonstration , 2016, OOPSLA.

[6]  Torsten Hoefler,et al.  Neural Code Comprehension: A Learnable Representation of Code Semantics , 2018, NeurIPS.

[7]  Pushmeet Kohli,et al.  Neuro-Symbolic Program Corrector for Introductory Programming Assignments , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[8]  Liang Zheng,et al.  Circle Loss: A Unified Perspective of Pair Similarity Optimization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[10]  Charles Sutton,et al.  Learning to Represent Programs with Property Signatures , 2020, ICLR.

[11]  Yiran Chen,et al.  A Survey of Accelerator Architectures for Deep Neural Networks , 2020 .

[12]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[13]  Muhammad Ahsan Latif,et al.  Cyber Security Threats Detection in Internet of Things Using Deep Learning Approach , 2019, IEEE Access.

[15]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[16]  Jacques Klein,et al.  FaCoY – A Code-to-Code Search Engine , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[17]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[18]  Fred L. Drake,et al.  Python 3 Reference Manual , 2009 .

[19]  Chao Zhang,et al.  $\alpha$ Diff: Cross-Version Binary Code Similarity Detection with DNN , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[20]  Peter A. Flach,et al.  Precision-Recall-Gain Curves: PR Analysis Done Right , 2015, NIPS.

[21]  Curtis R. Cook,et al.  An Investigation of Procedure and Variable Names as Beacons During Program Comprehension , 1991 .

[22]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[23]  Shahin Nazarian,et al.  Taming Extreme Heterogeneity via Machine Learning based Design of Autonomous Manycore Systems , 2019, 2019 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[24]  Koushik Sen,et al.  DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[25]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[27]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[28]  Rishabh Singh,et al.  Global Relational Models of Source Code , 2020, ICLR.

[29]  Uri Alon,et al.  A general path-based representation for predicting program properties , 2018, PLDI.

[30]  Pradeep Dubey,et al.  Can traditional programming bridge the Ninja performance gap for parallel computing applications? , 2015, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[31]  Yue Wang,et al.  Code Completion with Neural Attention and Pointer Networks , 2017, IJCAI.

[32]  Jordi Cabot,et al.  A Systematic Mapping Study of Software Development With GitHub , 2017, IEEE Access.

[33]  Gabriele Bavota,et al.  Deep Learning Similarities from Different Representations of Source Code , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[34]  Koushik Sen,et al.  Aroma: code recommendation via structural code search , 2018, Proc. ACM Program. Lang..

[35]  Armando Solar-Lezama,et al.  The three pillars of machine programming , 2018, MAPL@PLDI.

[36]  L. Floridi Artificial Intelligence, Deepfakes and a Future of Ectypes , 2018, Philosophy & Technology.

[37]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[38]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[39]  Hailong Sun,et al.  A Novel Neural Source Code Representation Based on Abstract Syntax Tree , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[40]  Le Song,et al.  Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs , 2020, ICLR.

[41]  Timothy Mattson,et al.  A Zero-Positive Learning Approach for Diagnosing Software Performance Regressions , 2017, NeurIPS.

[42]  Mary Shaw,et al.  Global variable considered harmful , 1973, SIGP.

[43]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[44]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[45]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[46]  Dror Feitelson,et al.  How Developers Choose Names , 2021, IEEE Transactions on Software Engineering.

[47]  M. Lipson,et al.  Schaum's Outline of Theory and Problems of Linear Algebra , 1968 .

[48]  Johannes Bader,et al.  Getafix: learning to fix bugs automatically , 2019, Proc. ACM Program. Lang..

[49]  D. Flannanghan JavaScript: The definitive guide , 1999 .

[50]  Maaz Bin Safeer Ahmad,et al.  Automatically translating image processing libraries to halide , 2019, ACM Trans. Graph..

[51]  Max Welling,et al.  Modeling Relational Data with Graph Convolutional Networks , 2017, ESWC.

[52]  Justin Emile Gottschlich,et al.  AI programmer: autonomously creating software programs using genetic algorithms , 2017, GECCO Companion.

[53]  Ser-Nam Lim,et al.  A Metric Learning Reality Check , 2020, ECCV.

[54]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[55]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.