In-IDE Code Generation from Natural Language: Promise and Challenges

A great part of software development involves conceptualizing or communicating the underlying procedures and logic that needs to be expressed in programs. One major difficulty of programming is turning concept into code, especially when dealing with the APIs of unfamiliar libraries. Recently, there has been a proliferation of machine learning methods for code generation and retrieval from natural language queries, but these have primarily been evaluated purely based on retrieval accuracy or overlap of generated code with developer-written code, and the actual effect of these methods on the developer workflow is surprisingly unattested. In this article, we perform the first comprehensive investigation of the promise and challenges of using such technology inside the PyCharm IDE, asking, “At the current state of technology does it improve developer productivity or accuracy, how does it affect the developer experience, and what are the remaining gaps and challenges?” To facilitate the study, we first develop a plugin for the PyCharm IDE that implements a hybrid of code generation and code retrieval functionality, and we orchestrate virtual environments to enable collection of many user events (e.g., web browsing, keystrokes, fine-grained code edits). We ask developers with various backgrounds to complete 7 varieties of 14 Python programming tasks ranging from basic file manipulation to machine learning or data visualization, with or without the help of the plugin. While qualitative surveys of developer experience are largely positive, quantitative results with regards to increased productivity, code quality, or program correctness are inconclusive. Further analysis identifies several pain points that could improve the effectiveness of future machine learning-based code generation/retrieval developer assistants and demonstrates when developers prefer code generation over code retrieval and vice versa. We release all data and software to pave the road for future empirical studies on this topic, as well as development of better code generation models.

[1]  Brad A. Myers,et al.  Designing the whyline: a debugging interface for asking questions about program behavior , 2004, CHI.

[2]  Thomas D. LaToza,et al.  Programmers Are Users Too: Human-Centered Methods for Improving Programming Tools , 2016, Computer.

[3]  Brad A. Myers,et al.  Debugging reinvented , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[4]  Magnus C. Ohlsson,et al.  Experimentation in Software Engineering , 2000, The Kluwer International Series in Software Engineering.

[5]  Chanchal Kumar Roy,et al.  SurfClipse: Context-Aware Meta-search in the IDE , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[6]  Ting Liu,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, FINDINGS.

[7]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[8]  Gabriele Bavota,et al.  Mining StackOverflow to turn the IDE into a self-confident programming prompter , 2014, MSR 2014.

[9]  Paul C. Johnson Extension of Nakagawa & Schielzeth's R2GLMM to random slopes models , 2014, Methods in ecology and evolution.

[10]  Craig A. Knoblock,et al.  Query reformulation for dynamic information integration , 1996, Journal of Intelligent Information Systems.

[11]  HENRY LIEBERMAN,et al.  End-User Development: An Emerging Paradigm , 2006, End User Development.

[12]  Percy Liang,et al.  SPoC: Search-based Pseudocode to Code , 2019, NeurIPS.

[13]  Graham Neubig,et al.  Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[14]  Andrew Macvean,et al.  MARBLE: Mining for Boilerplate Code to Identify API Usability Problems , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[15]  Gabriele Bavota,et al.  How Can I Use This Method? , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[16]  Yves Deville,et al.  Synthesis of Programs in Computational Logic , 2004, Program Development in Computational Logic.

[17]  Brad A. Myers,et al.  Natural programming languages and environments , 2004, Commun. ACM.

[18]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[19]  Gabriele Bavota,et al.  Automatic query reformulations for text retrieval in software engineering , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[20]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[21]  J. L. Hodges,et al.  Estimates of Location Based on Rank Tests , 1963 .

[22]  Michele Lanza,et al.  Seahawk: Stack Overflow in the IDE , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[23]  Janice Singer,et al.  Guide to Advanced Empirical Software Engineering , 2007 .

[24]  Brad A. Myers,et al.  Variolite: Supporting Exploratory Programming by Data Scientists , 2017, CHI.

[25]  Alexander Serebrenik,et al.  Empirical analysis of the relationship between CC and SLOC in a large corpus of Java methods and C functions , 2016, J. Softw. Evol. Process..

[26]  Jonathan Berant,et al.  Building a Semantic Parser Overnight , 2015, ACL.

[27]  Dan Klein,et al.  Semantic Scaffolds for Pseudocode-to-Code Generation , 2020, ACL.

[28]  Joseph Hilbe,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2009 .

[29]  John Maloney,et al.  The Scratch Programming Language and Environment , 2010, TOCE.

[30]  Christoph Treude,et al.  NLP2Code: Code Snippet Content Assist via Natural Language Tasks , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[31]  Elena L. Glassman,et al.  Interactive Extraction of Examples from Existing Code , 2018, CHI.

[32]  Xuchen Yao,et al.  Information Extraction over Structured Data: Question Answering with Freebase , 2014, ACL.

[33]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[34]  Noor Zaman,et al.  Rubric based assessment plan implementation for Computer Science program: A practical approach , 2013, Proceedings of 2013 IEEE International Conference on Teaching, Assessment and Learning for Engineering (TALE).

[35]  Graham Neubig,et al.  Reranking for Neural Semantic Parsing , 2019, ACL.

[36]  Alvin Cheung,et al.  Mapping Language to Code in Programmatic Context , 2018, EMNLP.

[37]  Brad A. Myers,et al.  API Designers in the Field: Design Practices and Challenges for Creating Usable APIs , 2018, 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[38]  Reidar Conradi,et al.  Quality, productivity and economic benefits of software reuse: a review of industrial studies , 2007, Empirical Software Engineering.

[39]  Björn Hartmann,et al.  Writing Reusable Code Feedback at Scale with Mixed-Initiative Program Synthesis , 2017, L@S.

[40]  Mukund Raghothaman,et al.  SWIM: Synthesizing What I Mean - Code Search and Idiomatic Snippet Synthesis , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[41]  Michele Lanza,et al.  Harnessing Stack Overflow for the IDE , 2012, 2012 Third International Workshop on Recommendation Systems for Software Engineering (RSSE).

[42]  Davide Di Ruscio,et al.  Supporting the understanding and comparison of low-code development platforms , 2020, 2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA).

[43]  David Lo,et al.  Query expansion via WordNet for effective code search , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[44]  Brad A. Myers,et al.  Six Learning Barriers in End-User Programming Systems , 2004, 2004 IEEE Symposium on Visual Languages - Human Centric Computing.

[45]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[46]  Tomoki Toda,et al.  Semantic Parsing of Ambiguous Input through Paraphrasing and Verification , 2015, TACL.

[47]  Lihong Li,et al.  Neuro-Symbolic Program Synthesis , 2016, ICLR.

[48]  Graham Neubig,et al.  TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation , 2018, EMNLP.

[49]  H. Rice Classes of recursively enumerable sets and their decision problems , 1953 .

[50]  Ying Zou,et al.  Spotting working code examples , 2014, ICSE.

[51]  Kathryn T. Stolee,et al.  How developers search for code: a case study , 2015, ESEC/SIGSOFT FSE.

[52]  Sumit Gulwani,et al.  FlashExtract: a framework for data extraction by examples , 2014, PLDI.

[53]  Dawn Song,et al.  Execution-Guided Neural Program Synthesis , 2018, ICLR.

[54]  Brad A. Myers,et al.  Improving API usability , 2016, Commun. ACM.

[55]  Henry Lieberman,et al.  Watch what I do: programming by demonstration , 1993 .

[56]  George E. Heidorn Automatic Programming Through Natural Language Dialogue: A Survey , 1976, IBM J. Res. Dev..

[57]  Chao Liu,et al.  Opportunities and Challenges in Code Search Tools , 2020, ACM Comput. Surv..

[58]  Matthew Richardson,et al.  Predicting clicks: estimating the click-through rate for new ads , 2007, WWW '07.

[59]  Andrew Bell,et al.  Fixed and random effects models: making an informed choice , 2018, Quality & Quantity.

[60]  Mira Mezini,et al.  A Study of Visual Studio Usage in Practice , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[61]  Tom M. Mitchell,et al.  APPINITE: A Multi-Modal Interface for Specifying Data Descriptions in Programming by Demonstration Using Natural Language Instructions , 2018, 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[62]  Oleksandr Polozov,et al.  Program Synthesis and Semantic Parsing with Learned Code Idioms , 2019, NeurIPS.

[63]  James R. Curran,et al.  Programming With Unrestricted Natural Language , 2005, ALTA.

[64]  Sumit Gulwani,et al.  Building Bing Developer Assistant , 2015 .

[65]  Luke S. Zettlemoyer,et al.  Online Learning of Relaxed CCG Grammars for Parsing to Logical Form , 2007, EMNLP.

[66]  Graham Neubig,et al.  Incorporating External Knowledge through Pre-training for Natural Language to Code Generation , 2020, ACL.

[67]  Reid Holmes,et al.  Live API documentation , 2014, ICSE.

[68]  Claes Wohlin,et al.  Experimentation in Software Engineering , 2012, Springer Berlin Heidelberg.

[69]  Sumit Gulwani,et al.  Browser Record and Replay as a Building Block for End-User Web Automation Tools , 2015, WWW.

[70]  Brian M. Sadler,et al.  Interactive Semantic Parsing for If-Then Recipes via Hierarchical Reinforcement Learning , 2018, AAAI.

[71]  Sebastian Nowozin,et al.  DeepCoder: Learning to Write Programs , 2016, ICLR.

[72]  Daniel Gildea,et al.  Integrating Programming by Example and Natural Language Programming , 2013, AAAI.

[73]  Shinichi Nakagawa,et al.  A general and simple method for obtaining R2 from generalized linear mixed‐effects models , 2013 .

[74]  Emily Hill,et al.  NL-based query refinement and contextualized code search results: A user study , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[75]  Y. Mundlak On the Pooling of Time Series and Cross Section Data , 1978 .

[76]  Isil Dillig,et al.  Synthesizing data structure transformations from input-output examples , 2015, PLDI.

[77]  Charles A. Sutton,et al.  Learning natural coding conventions , 2014, SIGSOFT FSE.

[78]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[79]  Tony Beltramelli,et al.  pix2code: Generating Code from a Graphical User Interface Screenshot , 2017, EICS.

[80]  Toby Jia-Jun Li,et al.  PUMICE: A Multi-Modal Agent that Learns Concepts and Conditionals from Natural Language and Demonstrations , 2019, UIST.

[81]  Tien N. Nguyen,et al.  Does BLEU Score Work for Code Migration? , 2019, 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC).

[82]  Premkumar T. Devanbu,et al.  On the localness of software , 2014, SIGSOFT FSE.

[83]  Ellen Riloff,et al.  NaturalJava: a natural language interface for programming in Java , 2000, IUI '00.

[84]  Ned Kock,et al.  Lateral Collinearity and Misleading Results in Variance-Based SEM: An Illustration and Recommendations , 2012, J. Assoc. Inf. Syst..

[85]  Hayley Dawson,et al.  The Questions , 2018, Counting Down.

[86]  Shuchi Grover,et al.  What We Can Learn About Student Learning From Open-Ended Programming Projects in Middle School Computer Science , 2018, SIGCSE.

[87]  Anita Sarma,et al.  ANNE: Improving Source Code Search using Entity Retrieval Approach , 2017, WSDM.

[88]  Tuan Anh Nguyen,et al.  Reverse Engineering Mobile Application User Interfaces with REMAUI (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[89]  Chanchal Kumar Roy,et al.  Towards a context-aware IDE-based meta search engine for recommendation about programming errors and exceptions , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[90]  Sebastian Proksch,et al.  Enriched Event Streams: A General Dataset for Empirical Studies on In-IDE Activities of Software Developers , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[91]  Armando Solar-Lezama,et al.  Program synthesis by sketching , 2008 .

[92]  Edsger W. Dijkstra,et al.  On the Foolishness of "Natural Language Programming" , 1978, Program Construction.

[93]  Henry Lieberman,et al.  NLP (Natural Language Processing) for NLP (Natural Language Programming) , 2006, CICLing.

[94]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[95]  Andrew D. Gordon,et al.  Bimodal Modelling of Source Code and Natural Language , 2015, ICML.

[96]  Premkumar T. Devanbu,et al.  CACHECA: A Cache Language Model Based Code Suggestion Tool , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[97]  Sumit Gulwani,et al.  Ringer: web automation by demonstration , 2016, OOPSLA.

[98]  Raymond J. Mooney,et al.  Learning to Parse Database Queries Using Inductive Logic Programming , 1996, AAAI/IAAI, Vol. 2.

[99]  Huan Sun,et al.  CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning , 2019, WWW.

[100]  Deborah E. White,et al.  Thematic Analysis , 2017 .

[101]  Isil Dillig,et al.  Program synthesis using conflict-driven learning , 2017, PLDI.

[102]  Claudia Biermann,et al.  Mathematical Methods Of Statistics , 2016 .

[103]  Dorsa Sadigh,et al.  Learning Adaptive Language Interfaces through Decomposition , 2020, INTEXSEMPAR.

[104]  Graham Neubig,et al.  Retrieval-Based Neural Code Generation , 2018, EMNLP.

[105]  Jerrold M Ginsparg Natural Language Processing in an Automatic Programming Domain , 1978 .

[106]  Maksym Zavershynskyi,et al.  NAPS: Natural Program Synthesis Dataset , 2018, ArXiv.

[107]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[108]  Philip J. Guo,et al.  Two studies of opportunistic programming: interleaving web foraging, learning, and writing code , 2009, CHI.

[109]  Brad A. Myers,et al.  Exploring exploratory programming , 2017, 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[110]  Jiangang Zhu,et al.  EXPSOL: Recommending Online Threads for Exception-Related Bug Reports , 2016, 2016 23rd Asia-Pacific Software Engineering Conference (APSEC).

[111]  Daniel S. Weld,et al.  StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow , 2018, WWW.

[112]  Regina Barzilay,et al.  From Natural Language Specifications to Program Input Parsers , 2013, ACL.

[113]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[114]  Sumit Gulwani,et al.  Compositional Program Synthesis from Natural Language and Examples , 2015, IJCAI.

[115]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[116]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[117]  Gordon Fraser,et al.  Does Automated Unit Test Generation Really Help Software Testers? A Controlled Empirical Study , 2015, ACM Trans. Softw. Eng. Methodol..

[118]  Regina Barzilay,et al.  Using Semantic Unification to Generate Regular Expressions from Natural Language , 2013, NAACL.

[119]  Wang Ling,et al.  Latent Predictor Networks for Code Generation , 2016, ACL.

[120]  Tiffany Barnes,et al.  Application of the Delphi Method in Computer Science Principles Rubric Creation , 2017, ITiCSE.

[121]  Percy Liang,et al.  A Retrieve-and-Edit Framework for Predicting Structured Outputs , 2018, NeurIPS.

[122]  Armando Solar-Lezama,et al.  Write, Execute, Assess: Program Synthesis with a REPL , 2019, NeurIPS.

[123]  Amos Azaria,et al.  SUGILITE: Creating Multimodal Smartphone Automation by Demonstration , 2017, CHI.

[124]  Rastislav Bodík,et al.  Rousillon: Scraping Distributed Hierarchical Web Data , 2018, UIST.

[125]  Arvind Srikantan,et al.  ColloQL: Robust Text-to-SQL Over Search Queries , 2020, INTEXSEMPAR.

[126]  Luke Zettlemoyer,et al.  JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation , 2019, EMNLP.

[127]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[128]  Sarah Nadi,et al.  FeedBaG: An interaction tracker for Visual Studio , 2016, 2016 IEEE 24th International Conference on Program Comprehension (ICPC).

[129]  Noam M. Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..

[130]  Annibale Panichella,et al.  DeepTC-Enhancer: Improving the Readability of Automatically Generated Tests , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).