论文信息 - Learning Autocompletion from Real-World Datasets

Learning Autocompletion from Real-World Datasets

Code completion is a popular software development tool integrated into all major IDEs. Many neural language models have achieved promising results in completion suggestion prediction on synthetic benchmarks. However, a recent study When Code Completion Fails: a Case Study on Real-World Completions demonstrates that these results may not translate to improvements in real-world performance. To combat this effect, we train models on real-world code completion examples and find that these models outperform models trained on committed source code and working version snapshots by 12.8% and 13.8% accuracy respectively. We observe this improvement across modeling technologies and show through A/B testing that it corresponds to a 6.2% increase in programmers' actual autocompletion usage. Furthermore, our study characterizes a large corpus of logged autocompletion usages to investigate why training on real-world examples leads to stronger models.

Hongyu Li | Seohyun Kim | Gareth Ari Aye | Seohyun Kim | Hongyu Li

[1] Premkumar T. Devanbu,et al. On the naturalness of software , 2016, Commun. ACM.

[2] Eran Yahav,et al. Code completion with statistical language models , 2014, PLDI.

[3] Mira Mezini,et al. Learning from examples to improve code completion systems , 2009, ESEC/SIGSOFT FSE.

[4] Miltiadis Allamanis,et al. The adverse effects of code duplication in machine learning models of code , 2018, Onward!.

[5] Rafael-Michael Karampatsis,et al. Maybe Deep Neural Networks are the Best Choice for Modeling Source Code , 2019, ArXiv.

[6] Martin T. Vechev,et al. PHOG: Probabilistic Model for Code , 2016, ICML.

[7] Premkumar T. Devanbu,et al. Are deep neural networks the best choice for modeling source code? , 2017, ESEC/SIGSOFT FSE.

[8] Harald C. Gall,et al. When Code Completion Fails: A Case Study on Real-World Completions , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[9] C MurphyGail,et al. How Are Java Software Developers Using the Eclipse IDE , 2006 .

[10] Anh Tuan Nguyen,et al. A statistical semantic language model for source code , 2013, ESEC/FSE 2013.

[11] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[12] Yue Wang,et al. Code Completion with Neural Attention and Pointer Networks , 2017, IJCAI.

[13] Martin T. Vechev,et al. Probabilistic model for code with decision trees , 2016, OOPSLA.

[14] Kenneth Heafield,et al. KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[15] Mik Kersten,et al. How are Java software developers using the Elipse IDE? , 2006, IEEE Software.

[16] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[17] Gail E. Kaiser,et al. Sequence Model Design for Code Completion in the Modern IDE , 2020, ArXiv.

[18] Oleksandr Polozov,et al. Generative Code Modeling with Graphs , 2018, ICLR.

[19] Satish Chandra,et al. Code Prediction by Feeding Trees to Transformers , 2020, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[20] Omer Levy,et al. Structural Language Models of Code , 2019, ICML.

[21] Philipp Koehn,et al. Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.