Can Identifier Splitting Improve Open-Vocabulary Language Model of Code?

Statistical language models on source code have successfully assisted software engineering tasks. However, developers can create or pick arbitrary identifiers when writing source code. Freely chosen identifiers lead to the notorious out-of-vocabulary (OOV) problem that negatively affects model performance. Recently, Karampatsis et al. showed that using the Byte Pair Encoding (BPE) algorithm to address the OOV problem can improve the language models’ predictive performance on source code. However, a drawback of BPE is that it cannot split the identifiers in a way that preserves the meaningful semantics. Prior researchers also show that splitting compound identifiers into sub-words that reflect the semantics can benefit software development tools. These two facts motivate us to explore whether identifier splitting techniques can be utilized to augment the BPE algorithm and boost the performance of open-vocabulary language models considered in Karampatsis et al.’s work. This paper proposes to split identifiers in both constructing vocabulary and processing model inputs procedures, thus exploiting three different settings of applying identifier splitting to language models for the code completion task. We contrast models’ performance under these settings and find that simply inserting identifier splitting into the pipeline hurts the model performance, while a hybrid strategy combining identifier splitting and the BPE algorithm can outperform the original open-vocabulary models on predicting identifiers by 3.68% of recall and 6.32% of Mean Reciprocal Rank. The results also show that the hybrid strategy can improve the entropy of language models by 2.02%.

[1]  Harald C. Gall,et al.  When Code Completion Fails: A Case Study on Real-World Completions , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[2]  Giuliano Antoniol,et al.  Can Better Identifier Splitting Techniques Help Feature Location? , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[3]  Premkumar T. Devanbu,et al.  Recovering clear, natural identifiers from obfuscated JS names , 2017, ESEC/SIGSOFT FSE.

[4]  Denys Poshyvanyk,et al.  SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair , 2018, IEEE Transactions on Software Engineering.

[5]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[6]  Michael Hucka,et al.  Spiral: splitters for identifiers in source code files , 2018, J. Open Source Softw..

[7]  David W. Binkley,et al.  To camelcase or under_score , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[8]  David Lo,et al.  Query expansion via WordNet for effective code search , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[9]  Zaixiang Zheng,et al.  Vocabulary Learning via Optimal Transport for Neural Machine Translation , 2021, ACL/IJCNLP.

[10]  David Lo,et al.  IncBL: Incremental Bug Localization , 2021, 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[11]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[12]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[13]  David Lo,et al.  Modeling Functional Similarity in Source Code With Graph-Based Siamese Networks , 2020, IEEE Transactions on Software Engineering.

[14]  Emily Hill,et al.  Automatically capturing source code context of NL-queries for software maintenance and reuse , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[15]  Emily Hill,et al.  Mining source code to automatically split identifiers for software analysis , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[16]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[17]  David Lo,et al.  Version history, similar report, and structure: putting them together for improved bug localization , 2014, ICPC 2014.

[18]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[19]  Emily Hill,et al.  An empirical study of identifier splitting techniques , 2014, Empirical Software Engineering.

[20]  Andrea Janes,et al.  Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[21]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[22]  Premkumar T. Devanbu,et al.  On the localness of software , 2014, SIGSOFT FSE.