IdBench: Evaluating Semantic Representations of Identifier Names in Source Code

Identifier names convey useful information about the intended semantics of code. Name-based program analyses use this information, e.g., to detect bugs, to predict types, and to improve the readability of code. At the core of name-based analyses are semantic representations of identifiers, e.g., in the form of learned embeddings. The high-level goal of such a representation is to encode whether two identifiers, e.g., len and size, are semantically similar. Unfortunately, it is currently unclear to what extent semantic representations match the semantic relatedness and similarity perceived by developers. This paper presents IdBench, the first benchmark for evaluating semantic representations against a ground truth created from thousands of ratings by 500 software developers. We use IdBench to study state-of-the-art embedding techniques proposed for natural language, an embedding technique specifically designed for source code, and lexical string distance functions. Our results show that the effectiveness of semantic representations varies significantly and that the best available embeddings successfully represent semantic relatedness. On the downside, no existing technique provides a satisfactory representation of semantic similarities, among other reasons because identifiers with opposing meanings are incorrectly considered to be similar, which may lead to fatal mistakes, e.g., in a refactoring tool. Studying the strengths and weaknesses of the different techniques shows that they complement each other. As a first step toward exploiting this complementarity, we present an ensemble model that combines existing techniques and that clearly outperforms the best available semantic representation.

[1]  Martin T. Vechev,et al.  Probabilistic model for code with decision trees , 2016, OOPSLA.

[2]  David W. Binkley,et al.  Expanding identifiers to normalize source code vocabulary , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[3]  Torsten Hoefler,et al.  Neural Code Comprehension: A Learnable Representation of Code Semantics , 2018, NeurIPS.

[4]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[5]  Lu Zhang,et al.  Automatic and Accurate Expansion of Abbreviations in Parameters , 2020, IEEE Transactions on Software Engineering.

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Hui Liu,et al.  Semantic relation based expansion of abbreviations , 2019, ESEC/SIGSOFT FSE.

[8]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[9]  Georgios Gousios,et al.  TypeWriter: neural type prediction with search-based validation , 2020, ESEC/SIGSOFT FSE.

[10]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[11]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[12]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[13]  Sang Peter Chin,et al.  Automated software vulnerability detection with machine learning , 2018, ArXiv.

[14]  Einar W. Høst,et al.  Debugging Method Names , 2009, ECOOP.

[15]  Pushmeet Kohli,et al.  Semantic Code Repair using Neuro-Symbolic Transformation Networks , 2017, ICLR 2018.

[16]  Tao Xie,et al.  Inferring Resource Specifications from Natural Language API Documentation , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[17]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[18]  Yijun Yu,et al.  Improving the Tokenisation of Identifier Names , 2011, ECOOP.

[19]  Charles A. Sutton,et al.  Mining source code repositories at massive scale using language modeling , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[20]  Koushik Sen,et al.  DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[21]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[22]  Felix Hill,et al.  SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity , 2016, EMNLP.

[23]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[24]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[25]  Martha Palmer,et al.  Extending a Verb-lexicon Using a Semantically Annotated Corpus , 2004, LREC.

[26]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[27]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[28]  Yijun Yu,et al.  Exploring the Influence of Identifier Names on Code Quality: An Empirical Study , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[29]  David Lo,et al.  Assessing the Generalizability of Code2vec Token Embeddings , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[30]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[31]  Mihai Christodorescu,et al.  COSET: A Benchmark for Evaluating Neural Program Embeddings , 2019, ArXiv.

[32]  Dawn J Lawrie,et al.  Extracting Meaning from Abbreviated Identifiers , 2007, Seventh IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM 2007).

[33]  Vitalii Zhelezniak,et al.  Correlation Coefficients and Semantic Textual Similarity , 2019, NAACL.

[34]  Michael Pradel,et al.  Neural Software Analysis , 2020, ArXiv.

[35]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[36]  Michael Pradel,et al.  NL2Type: Inferring JavaScript Function Types from Natural Language Information , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[37]  Thorsten Joachims,et al.  Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[38]  Sergio Di Martino,et al.  LINSEN: An efficient approach to split identifiers and expand abbreviations , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[39]  Michael Pradel,et al.  Detecting argument selection defects , 2017, Proc. ACM Program. Lang..

[40]  Neville Ryant,et al.  A large-scale classification of English verbs , 2008, Lang. Resour. Evaluation.

[41]  Anh Tuan Nguyen,et al.  Statistical Learning of API Fully Qualified Names in Code Snippets of Online Forums , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[42]  Christian Bird,et al.  Deep learning type inference , 2018, ESEC/SIGSOFT FSE.

[43]  Thomas A. Schreiber,et al.  The University of South Florida free association, rhyme, and word fragment norms , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[44]  Trong Duc Nguyen,et al.  Exploring API Embedding for API Usages and Applications , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[45]  Martin Monperrus,et al.  A Literature Study of Embeddings on Source Code , 2019, ArXiv.

[46]  Charles Sutton,et al.  SCELMo: Source Code Embeddings from Language Models , 2020, ArXiv.

[47]  Thomas R. Gross,et al.  Detecting anomalies in the order of equally-typed method arguments , 2011, ISSTA '11.

[48]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[49]  Yves Le Traon,et al.  Learning to Spot and Refactor Inconsistent Method Names , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[50]  Xiangyu Zhang,et al.  Phys: probabilistic physical unit assignment and inconsistency detection , 2018, ESEC/SIGSOFT FSE.

[51]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[52]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[53]  Yue Luo,et al.  Nomen est Omen: Exploring and Exploiting Similarities between Argument and Parameter Names , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[54]  Anthony Peruma,et al.  An Empirical Study of Abbreviations and Expansions in Software Artifacts , 2019, 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[55]  Aditya V. Thakur,et al.  Path-Based Function Embedding and its Application to Specification Mining , 2018, ArXiv.

[56]  Baowen Xu,et al.  Python probabilistic type inference with natural language support , 2016, SIGSOFT FSE.

[57]  Romain Robbes,et al.  Modeling Vocabulary for Big Code Machine Learning , 2019, ArXiv.

[58]  Uri Alon,et al.  A general path-based representation for predicting program properties , 2018, PLDI.

[59]  Shuvendu K. Lahiri,et al.  Code vectors: understanding programs through embedded abstracted symbolic traces , 2018, ESEC/SIGSOFT FSE.

[60]  Andreas Krause,et al.  Learning programs from noisy data , 2016, POPL.

[61]  Stefanie Nowak,et al.  How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation , 2010, MIR '10.

[62]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[63]  Kai-Wei Chang,et al.  Building Language Models for Text with Named Entities , 2018, ACL.