Commit2Vec: Learning Distributed Representations of Code Changes

Deep learning methods, which have found successful applications in fields like image classification and natural language processing, have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.g., from open-source software repositories). In this work, we elaborate upon a state-of-the-art approach to the representation of source code that uses information about its syntactic structure, and we adapt it to represent source changes (i.e., commits). We use this representation to classify security-relevant commits. Because our method uses transfer learning (that is, we train a network on a "pretext task" for which abundant labeled data is available, and then we use such network for the target task of commit classification, for which fewer labeled instances are available), we studied the impact of pre-training the network using two different pretext tasks versus a randomly initialized model. Our results indicate that representations that leverage the structural information obtained through code syntax outperform token-based representations. Furthermore, the performance metrics obtained when pre-training on a loosely related pretext task with a very large dataset ($>10^6$ samples) were surpassed when pretraining on a smaller dataset ($>10^4$ samples) but for a pretext task that is more closely related to the target task.

[1]  Cícero Nogueira dos Santos,et al.  Learning Character-level Representations for Part-of-Speech Tagging , 2014, ICML.

[2]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[3]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[4]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[5]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[6]  Michele Bezzi,et al.  A Practical Approach to the Automatic Classification of Security-Relevant Commits , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[7]  Bin Wang,et al.  Detecting Copy Directions among Programs Using Extreme Learning Machines , 2015 .

[8]  Martin Monperrus,et al.  A Literature Study of Embeddings on Source Code , 2019, ArXiv.

[9]  Nicholas Ayache,et al.  Fine-tuned convolutional neural nets for cardiac MRI acquisition plane recognition , 2017, Comput. methods Biomech. Biomed. Eng. Imaging Vis..

[10]  Michele Bezzi,et al.  A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[11]  Yaqin Zhou,et al.  Automated identification of security issues from commit messages and bug reports , 2017, ESEC/SIGSOFT FSE.

[12]  Zhendong Su,et al.  On the naturalness of software , 2012, ICSE 2012.

[13]  Harald C. Gall,et al.  Change Distilling:Tree Differencing for Fine-Grained Source Code Change Extraction , 2007, IEEE Transactions on Software Engineering.

[14]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[15]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[16]  Graham Neubig,et al.  Learning to Represent Edits , 2018, ICLR.

[17]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[18]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.