Assessing the Generalizability of Code2vec Token Embeddings

Many Natural Language Processing (NLP) tasks, such as sentiment analysis or syntactic parsing, have benefited from the development of word embedding models. In particular, regardless of the training algorithms, the learned embeddings have often been shown to be generalizable to different NLP tasks. In contrast, despite recent momentum on word embeddings for source code, the literature lacks evidence of their generalizability beyond the example task they have been trained for. In this experience paper, we identify 3 potential downstream tasks, namely code comments generation, code authorship identification, and code clones detection, that source code token embedding models can be applied to. We empirically assess a recently proposed code token embedding model, namely code2vec's token embeddings. Code2vec was trained on the task of predicting method names, and while there is potential for using the vectors it learns on other tasks, it has not been explored in literature. Therefore, we fill this gap by focusing on its generalizability for the tasks we have identified. Eventually, we show that source code token embeddings cannot be readily leveraged for the downstream tasks. Our experiments even show that our attempts to use them do not result in any improvements over less sophisticated methods. We call for more research into effective and general use of code embeddings.

[1]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[2]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[3]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4]  Aziz Mohaisen,et al.  Large-Scale and Language-Oblivious Code Authorship Identification , 2018, CCS.

[5]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[6]  Uri Alon,et al.  A general path-based representation for predicting program properties , 2018, PLDI.

[7]  Deborah A. Coughlin,et al.  Correlating automated and human assessments of machine translation quality , 2003, MTSUMMIT.

[8]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[9]  Martin Monperrus,et al.  A Literature Study of Embeddings on Source Code , 2019, ArXiv.

[10]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[11]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[12]  Yulia Tsvetkov,et al.  Problems With Evaluation of Word Embeddings Using Word Similarity Tasks , 2016, RepEval@ACL.

[13]  Lerina Aversano,et al.  An empirical study on the maintenance of source code clones , 2010, Empirical Software Engineering.

[14]  Martin White,et al.  Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities , 2017, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[15]  Pushmeet Kohli,et al.  Semantic Code Repair using Neuro-Symbolic Transformation Networks , 2017, ICLR 2018.

[16]  Foutse Khomh,et al.  Late propagation in software clones , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[17]  Chanchal Kumar Roy,et al.  Towards a Big Data Curated Benchmark of Inter-project Code Clones , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[18]  Xiaodong Gu,et al.  Deep API learning , 2016, SIGSOFT FSE.

[19]  Mario Linares Vásquez,et al.  ChangeScribe: A Tool for Automatically Generating Commit Messages , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[20]  Tim Menzies,et al.  Easy over hard: a case study on deep learning , 2017, ESEC/SIGSOFT FSE.

[21]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[22]  Zhenchang Xing,et al.  Neural-Machine-Translation-Based Commit Message Generation: How Far Are We? , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[23]  Xiaodong Gu,et al.  DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning , 2017, IJCAI.

[24]  Tom Van Cutsem,et al.  Import2vec: Learning Embeddings for Software Libraries , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[25]  Koushik Sen,et al.  DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[26]  Nicholas A. Kraft,et al.  Exploring the use of deep learning for feature location , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[27]  Emily Hill,et al.  Towards automatically generating summary comments for Java methods , 2010, ASE.

[28]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[29]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[30]  Arvind Narayanan,et al.  De-anonymizing Programmers via Code Stylometry , 2015, USENIX Security Symposium.

[31]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[32]  Collin McMillan,et al.  Automatically generating commit messages from diffs using neural machine translation , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[33]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[34]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[35]  Premkumar T. Devanbu,et al.  Are deep neural networks the best choice for modeling source code? , 2017, ESEC/SIGSOFT FSE.

[36]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[37]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[38]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[39]  Shuvendu K. Lahiri,et al.  Code vectors: understanding programs through embedded abstracted symbolic traces , 2018, ESEC/SIGSOFT FSE.

[40]  Diomidis Spinellis,et al.  Semantic Source Code Models Using Identifier Embeddings , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[41]  Chanchal Kumar Roy,et al.  BigCloneEval: A Clone Detection Tool Evaluation Framework with BigCloneBench , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[42]  Christian Bird,et al.  Deep learning type inference , 2018, ESEC/SIGSOFT FSE.

[43]  Thorsten Joachims,et al.  Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[44]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[45]  Leonidas J. Guibas,et al.  Learning Program Embeddings to Propagate Feedback on Student Code , 2015, ICML.

[46]  Ettore Merlo,et al.  Experiment on the automatic detection of function clones in a software system using metrics , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[47]  Yijun Yu,et al.  Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks , 2017, AAAI Workshops.

[48]  Yves Le Traon,et al.  Learning to Spot and Refactor Inconsistent Method Names , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[49]  Cristina V. Lopes,et al.  Oreo: detection of clones in the twilight zone , 2018, ESEC/SIGSOFT FSE.

[50]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[51]  Andrian Marcus,et al.  Supporting program comprehension with source code summarization , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[52]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[53]  Ignacio Iacobacci,et al.  Embeddings for Word Sense Disambiguation: An Evaluation Study , 2016, ACL.

[54]  Aditya V. Thakur,et al.  Path-Based Function Embedding and its Application to Specification Mining , 2018, ArXiv.

[55]  I-Han Hsiao,et al.  user2code2vec: Embeddings for Profiling Students Based on Distributional Representations of Source Code , 2019, LAK.

[56]  Krzysztof Czarnecki,et al.  An Exploratory Study of Cloning in Industrial Software Product Lines , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[57]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[58]  Ming Li,et al.  Positive and Unlabeled Learning for Detecting Software Functional Clones with Adversarial Training , 2018, IJCAI.

[59]  Sampo Pyysalo,et al.  Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance , 2016, RepEval@ACL.

[60]  Chanchal Kumar Roy,et al.  Mining Duplicate Questions of Stack Overflow , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[61]  Chanchal Kumar Roy,et al.  Evaluating clone detection tools with BigCloneBench , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[62]  Torsten Hoefler,et al.  Neural Code Comprehension: A Learnable Representation of Code Semantics , 2018, NeurIPS.

[63]  Artur Andrzejak,et al.  Learning-Based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection , 2019, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[64]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.