InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees

Building deep learning models on source code has found many successful software engineering applications, such as code search, code comment generation, bug detection, code migration, and so on. Current learning techniques, however, have a major drawback that these models are mostly trained on datasets labeled for particular downstream tasks, and code representations may not be suitable for other tasks. While some techniques produce representations from unlabeled code, they are far from satisfactory when applied to downstream tasks. Although certain techniques generate representations from unlabeled code when applied to downstream tasks they are far from satisfactory. This paper proposes InferCode to overcome the limitation by adapting the self-supervised learning mechanism to build source code model. The key novelty lies in training code representations by predicting automatically identified subtrees from the context of the ASTs. Subtrees in ASTs are treated with InferCode as the labels for training code representations without any human labeling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We trained an InferCode model instance using the Tree-based CNN as the encoder of a large set of Java code and applied it to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search or reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Compared to previous code learning techniques applied to the same downstream tasks, such as Code2Vec, Code2Seq, ASTNN, higher performance results are achieved using our pre-trained InferCode model with a significant margin for most tasks including those involving different programming languages.

[1]  Rahul Gupta,et al.  Neural Attribution for Semantic Bug-Localization in Student Programs , 2019, NeurIPS.

[2]  Zhi Jin,et al.  Discriminative Neural Sentence Modeling by Tree-Based Convolution , 2015, EMNLP.

[3]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4]  Lorenzo Torresani,et al.  Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[5]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[6]  Isil Dillig,et al.  LambdaNet: Probabilistic Type Inference using Graph Neural Networks , 2020, ICLR.

[7]  Yijun Yu fAST: Flattening Abstract Syntax Trees for Efficiency , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).

[8]  Xiaodong Gu,et al.  DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning , 2017, IJCAI.

[9]  Oscar Nierstrasz,et al.  Bug Prediction with Neural Nets Using regression-and classification-based approaches Bachelor Thesis , 2018 .

[10]  Jian Li,et al.  Software Defect Prediction via Convolutional Neural Network , 2017, 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[11]  Long Chen,et al.  Neural Detection of Semantic Code Clones Via Tree-Based Convolution , 2019, 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC).

[12]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[13]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[14]  Dawn Xiaodong Song,et al.  Tree-to-tree Neural Networks for Program Translation , 2018, NeurIPS.

[15]  Ming Li,et al.  Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code , 2017, IJCAI.

[16]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[17]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[18]  Honglak Lee,et al.  An efficient framework for learning sentence representations , 2018, ICLR.

[19]  M. de Rijke,et al.  Siamese CBOW: Optimizing Word Embeddings for Sentence Representations , 2016, ACL.

[20]  Avinash C. Kak,et al.  SCOR: Source Code Retrieval with Semantics and Order , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[21]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[22]  Marc Brockschmidt,et al.  Structured Neural Summarization , 2018, ICLR.

[23]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[24]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[25]  Andrea Vedaldi,et al.  Cross Pixel Optical Flow Similarity for Self-Supervised Learning , 2018, ACCV.

[26]  Jack W. Stokes,et al.  Large-scale malware classification using random projections and neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Yijun Yu,et al.  Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks , 2017, AAAI Workshops.

[28]  Jian Zhang,et al.  Classification of Android apps and malware using deep neural networks , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[29]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[30]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[31]  Jimmy J. Lin,et al.  A cascade ranking model for efficient ranked retrieval , 2011, SIGIR.

[32]  Ke Wang,et al.  Learning Blended, Precise Semantic Program Embeddings , 2019, ArXiv.

[33]  Yang Liu,et al.  graph2vec: Learning Distributed Representations of Graphs , 2017, ArXiv.

[34]  Koushik Sen,et al.  DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[35]  Felix Hill,et al.  Learning Distributed Representations of Sentences from Unlabelled Data , 2016, NAACL.

[36]  Percy Liang,et al.  Graph-based, Self-Supervised Program Repair from Diagnostic Feedback , 2020, ICML.

[37]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[38]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[39]  Songting Shi,et al.  Visualizing Data using GTSNE , 2021, ArXiv.

[40]  Koushik Sen,et al.  Retrieval on source code: a neural code search , 2018, MAPL@PLDI.

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  Martin Monperrus,et al.  A Literature Study of Embeddings on Source Code , 2019, ArXiv.

[43]  Mark J. Embrechts,et al.  On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification , 2009, ICANN.

[44]  Jonathan I. Maletic,et al.  srcML: An Infrastructure for the Exploration, Analysis, and Manipulation of Source Code: A Tool Demonstration , 2013, 2013 IEEE International Conference on Software Maintenance.

[45]  In-So Kweon,et al.  Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles , 2018, AAAI.

[46]  Lingxiao Jiang,et al.  Bilateral Dependency Neural Networks for Cross-Language Algorithm Classification , 2019, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[47]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[48]  Xinli Yang,et al.  Deep Learning for Just-in-Time Defect Prediction , 2015, 2015 IEEE International Conference on Software Quality, Reliability and Security.

[49]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[50]  Cristina V. Lopes,et al.  Oreo: detection of clones in the twilight zone , 2018, ESEC/SIGSOFT FSE.

[51]  Koushik Sen,et al.  Aroma: code recommendation via structural code search , 2018, Proc. ACM Program. Lang..

[52]  David Lo,et al.  Assessing the Generalizability of Code2vec Token Embeddings , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[53]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[54]  Alexander Kolesnikov,et al.  Revisiting Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Jacques Klein,et al.  FaCoY – A Code-to-Code Search Engine , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[56]  Philip S. Yu,et al.  Improving Automatic Source Code Summarization via Deep Reinforcement Learning , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[57]  Oleksandr Polozov,et al.  Generative Code Modeling with Graphs , 2018, ICLR.

[58]  Qingkai Shi,et al.  Functional code clone detection with syntax and semantics fusion learning , 2020, ISSTA.

[59]  Hailong Sun,et al.  A Novel Neural Source Code Representation Based on Abstract Syntax Tree , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[60]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[61]  Gabriele Bavota,et al.  Deep Learning Similarities from Different Representations of Source Code , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[62]  Zhi Jin,et al.  Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree , 2020, 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[63]  Sousuke Amasaki,et al.  A Doc2Vec-Based Assessment of Comments and Its Application to Change-Prone Method Analysis , 2018, 2018 25th Asia-Pacific Software Engineering Conference (APSEC).

[64]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[65]  Marie-Francine Moens,et al.  Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings , 2015, SIGIR.

[66]  He Jiang,et al.  Machine Learning Based Recommendation of Method Names: How Far are We , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[67]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[68]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[69]  Xueqi Cheng,et al.  A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations , 2015, AAAI.

[70]  Shangqing Liu,et al.  Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks , 2019, NeurIPS.

[71]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.