Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming

Code-recommendation systems, such as Copilot and CodeWhisperer, have the potential to improve programmer productivity by suggesting and auto-completing code. However, to fully realize their potential, we must understand how programmers interact with these systems and identify ways to improve that interaction. To make progress, we studied GitHub Copilot, a code-recommendation system used by millions of programmers daily. We developed CUPS, a taxonomy of common programmer activities when interacting with Copilot. Our study of 21 programmers, who completed coding tasks and retrospectively labeled their sessions with CUPS, showed that CUPS can help us understand how programmers interact with code-recommendation systems, revealing inefficiencies and time costs. Our insights reveal how programmers interact with Copilot and motivate new interface designs and metrics.

[1]  E. Horvitz,et al.  When to Show a Suggestion? Integrating Human Feedback in AI-Assisted Programming , 2023, ArXiv.

[2]  Sruti Srinivasa Ragavan,et al.  What is it like to program with artificial intelligence? , 2022, PPIG.

[3]  T. Bryksin,et al.  Out of the BLEU: how should we assess quality of the Code Generation models? , 2022, J. Syst. Softw..

[4]  N. Polikarpova,et al.  Grounded Copilot: How Programmers Interact with Code-Generating Models , 2022, Proc. ACM Program. Lang..

[5]  Arghavan Moradi Dakhel,et al.  GitHub Copilot AI pair programmer: Asset or Liability? , 2022, J. Syst. Softw..

[6]  Eirini Kalliamvakou,et al.  Productivity assessment of neural code completion , 2022, MAPS@PLDI.

[7]  Carrie J. Cai,et al.  Discovering the Syntax and Strategies of Natural Language Programming with Generative Language Models , 2022, CHI.

[8]  Elena L. Glassman,et al.  Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models , 2022, CHI Extended Abstracts.

[9]  M. Nagappan,et al.  Is GitHub's Copilot as Bad As Humans at Introducing Vulnerabilities in Code? , 2022, ArXiv.

[10]  Cherepanov,et al.  Competition-level code generation with AlphaCode , 2022, Science.

[11]  Ramesh Karri,et al.  Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions , 2021, 2022 IEEE Symposium on Security and Privacy (SP).

[12]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[13]  Dawn Song,et al.  Measuring Coding Challenge Competence With APPS , 2021, NeurIPS Datasets and Benchmarks.

[14]  Kartik Talamadupula,et al.  Perfection Not Required? Human-AI Partnerships in Code Translation , 2021, IUI.

[15]  Margaret-Anne Storey,et al.  The SPACE of developer productivity , 2021, Commun. ACM.

[16]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[17]  Ting Liu,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, FINDINGS.

[18]  Harald C. Gall,et al.  When Code Completion Fails: A Case Study on Real-World Completions , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[21]  David E. Kieras,et al.  The GOMS family of user interface analysis techniques: comparison and contrast , 1996, TCHI.

[22]  Thomas M. Cover,et al.  The Entropy Of Markov Trajectories , 1991, Proceedings. 1991 IEEE International Symposium on Information Theory.

[23]  Allen Newell,et al.  The keystroke-level model for user performance time with interactive systems , 1980, CACM.

[24]  Ramesh Karri,et al.  Can OpenAI Codex and Other Large Language Models Help Us Fix Security Bugs? , 2021, ArXiv.

[25]  Thomas Zimmermann,et al.  The SPACE of Developer Productivity: There's more to it than you think , 2021, ACM Queue.

[26]  M. Csíkszentmihályi Flow and the Foundations of Positive Psychology , 2014 .