论文信息 - MISIM: An End-to-End Neural Code Similarity System

MISIM: An End-to-End Neural Code Similarity System

Code similarity systems are integral to a range of applications from code recommendation to automated construction of software tests and defect mitigation. In this paper, we present Machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components. First, MISIM uses a novel context-aware similarity structure, which is designed to aid in lifting semantic meaning from code syntax. Second, MISIM provides a neural-based code similarity scoring system, which can be implemented with various neural network algorithms and topologies with learned parameters. We compare MISIM to three other state-of-the-art code similarity systems: (i) code2vec, (ii) Neural Code Comprehension, and (iii) Aroma. In our experimental evaluation across 45,780 programs, MISIM consistently outperformed all three systems, often by a large factor (upwards of 40.6x).

[1] Tao Wang,et al. Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[2] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[3] Kevin Barraclough,et al. I and i , 2001, BMJ : British Medical Journal.

[4] Oleksandr Polozov,et al. Generative Code Modeling with Graphs , 2018, ICLR.

[5] Sumit Gulwani,et al. Ringer: web automation by demonstration , 2016, OOPSLA.

[6] Torsten Hoefler,et al. Neural Code Comprehension: A Learnable Representation of Code Semantics , 2018, NeurIPS.

[7] Pushmeet Kohli,et al. Neuro-Symbolic Program Corrector for Introductory Programming Assignments , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[8] Liang Zheng,et al. Circle Loss: A Unified Perspective of Pair Similarity Optimization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Tie-Yan Liu,et al. Learning to rank for information retrieval , 2009, SIGIR.

[10] Charles Sutton,et al. Learning to Represent Programs with Property Signatures , 2020, ICLR.

[11] Yiran Chen,et al. A Survey of Accelerator Architectures for Deep Neural Networks , 2020 .

[12] Uri Alon,et al. code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[13] Muhammad Ahsan Latif,et al. Cyber Security Threats Detection in Internet of Things Using Deep Learning Approach , 2019, IEEE Access.

[15] Chanchal Kumar Roy,et al. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[16] Jacques Klein,et al. FaCoY – A Code-to-Code Search Engine , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[17] Lucas Beyer,et al. In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[18] Fred L. Drake,et al. Python 3 Reference Manual , 2009 .

[19] Chao Zhang,et al. $\alpha$ Diff: Cross-Version Binary Code Similarity Detection with DNN , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[20] Peter A. Flach,et al. Precision-Recall-Gain Curves: PR Analysis Done Right , 2015, NIPS.

[21] Curtis R. Cook,et al. An Investigation of Procedure and Variable Names as Beacons During Program Comprehension , 1991 .

[22] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[23] Shahin Nazarian,et al. Taming Extreme Heterogeneity via Machine Learning based Design of Autonomous Manycore Systems , 2019, 2019 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[24] Koushik Sen,et al. DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[25] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Premkumar T. Devanbu,et al. A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[27] Ronald L. Rivest,et al. Introduction to Algorithms, third edition , 2009 .

[28] Rishabh Singh,et al. Global Relational Models of Source Code , 2020, ICLR.

[29] Uri Alon,et al. A general path-based representation for predicting program properties , 2018, PLDI.

[30] Pradeep Dubey,et al. Can traditional programming bridge the Ninja performance gap for parallel computing applications? , 2015, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[31] Yue Wang,et al. Code Completion with Neural Attention and Pointer Networks , 2017, IJCAI.

[32] Jordi Cabot,et al. A Systematic Mapping Study of Software Development With GitHub , 2017, IEEE Access.

[33] Gabriele Bavota,et al. Deep Learning Similarities from Different Representations of Source Code , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[34] Koushik Sen,et al. Aroma: code recommendation via structural code search , 2018, Proc. ACM Program. Lang..

[35] Armando Solar-Lezama,et al. The three pillars of machine programming , 2018, MAPL@PLDI.

[36] L. Floridi. Artificial Intelligence, Deepfakes and a Future of Ectypes , 2018, Philosophy & Technology.

[37] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[38] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[39] Hailong Sun,et al. A Novel Neural Source Code Representation Based on Abstract Syntax Tree , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[40] Le Song,et al. Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs , 2020, ICLR.

[41] Timothy Mattson,et al. A Zero-Positive Learning Approach for Diagnosing Software Performance Regressions , 2017, NeurIPS.

[42] Mary Shaw,et al. Global variable considered harmful , 1973, SIGP.

[43] David Lo,et al. Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[44] Tsuyoshi Murata,et al. {m , 1934, ACML.

[45] Omer Levy,et al. code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[46] Dror Feitelson,et al. How Developers Choose Names , 2021, IEEE Transactions on Software Engineering.

[47] M. Lipson,et al. Schaum's Outline of Theory and Problems of Linear Algebra , 1968 .

[48] Johannes Bader,et al. Getafix: learning to fix bugs automatically , 2019, Proc. ACM Program. Lang..

[49] D. Flannanghan. JavaScript: The definitive guide , 1999 .

[50] Maaz Bin Safeer Ahmad,et al. Automatically translating image processing libraries to halide , 2019, ACM Trans. Graph..

[51] Max Welling,et al. Modeling Relational Data with Graph Convolutional Networks , 2017, ESWC.

[52] Justin Emile Gottschlich,et al. AI programmer: autonomously creating software programs using genetic algorithms , 2017, GECCO Companion.

[53] Ser-Nam Lim,et al. A Metric Learning Reality Check , 2020, ECCV.

[54] Marc Brockschmidt,et al. Learning to Represent Programs with Graphs , 2017, ICLR.

[55] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.