How Data Scientists Improve Generated Code Documentation in Jupyter Notebooks

Generative AI models are capable of creating high-fidelity outputs, sometimes indistinguishable from what could be produced by human effort. However, some domains possess an objective bar of quality, and the probabilistic nature of generative models suggests that there may be imperfections or flaws in their output. In software engineering, for example, code produced by a generative model may not compile, or it may contain bugs or logical errors. Various models of human-AI interaction, such as mixed-initiative user interfaces, suggest that human effort ought to be applied to a generative model’s outputs in order to improve its quality. We report results from a controlled experiment in which data scientists used multiple models – including a GNN-based generative model – to generate and subsequently edit documentation for data science code within Jupyter notebooks. In analyzing their edit-patterns, we discovered various ways that humans made improvements to the generated documentation, and speculate that such edit data could be used to train generative models to not only identify which parts of their output might require human attention, but also how those parts could be improved.

[1]  Yijun Yu,et al.  SAR: learning cross-language API mappings with little knowledge , 2019, ESEC/SIGSOFT FSE.

[2]  Soya Park,et al.  Themisto: Towards Automated Documentation Generation in Computational Notebooks , 2021, ArXiv.

[3]  Jeffrey M. Perkel,et al.  Why Jupyter is data scientists’ computational notebook of choice , 2018, Nature.

[4]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[5]  Martin White,et al.  Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities , 2017, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[6]  Albert Cabellos-Aparicio,et al.  Deep Reinforcement Learning meets Graph Neural Networks: exploring a routing optimization use case. , 2020 .

[7]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[8]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[9]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[10]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[11]  George Lakoff,et al.  A Figure of Thought , 1986 .

[12]  Leonard J. Bass,et al.  Scenario-Based Analysis of Software Architecture , 1996, IEEE Softw..

[13]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[14]  Collin McMillan,et al.  Improved Code Summarization via a Graph Neural Network , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[15]  Le Song,et al.  Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs , 2020, ICLR.

[16]  Brad A. Myers,et al.  The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool , 2018, CHI.

[17]  Refractor Uncertainty , 2001, The Lancet.

[18]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[19]  Gabriele Bavota,et al.  Automatic generation of release notes , 2014, SIGSOFT FSE.

[20]  Premkumar T. Devanbu,et al.  On the "naturalness" of buggy code , 2015, ICSE.

[21]  Sarina Abdul Halim Lim,et al.  Achieving Data Saturation: Evidence from a Qualitative Study of Job Satisfaction , 2018, Social and Management Research Journal.

[22]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[23]  Christopher D. Wickens,et al.  A model for types and levels of human interaction with automation , 2000, IEEE Trans. Syst. Man Cybern. Part A.

[24]  Rishabh Singh,et al.  Global Relational Models of Source Code , 2020, ICLR.

[25]  Neel Sundaresan,et al.  IntelliCode compose: code generation using transformer , 2020, ESEC/SIGSOFT FSE.

[26]  Mira Mezini,et al.  Learning from examples to improve code completion systems , 2009, ESEC/SIGSOFT FSE.

[27]  Lingxiao Jiang,et al.  TreeCaps: Tree-Structured Capsule Networks for Program Source Code Processing , 2019, ArXiv.

[28]  Richard E. Ladner,et al.  Understanding the Impact of TVIs on Technology Use and Selection by Children with Visual Impairments , 2019, CHI.

[29]  Ben Jelen,et al.  Understanding Older Adults' Participation in Design Workshops , 2020, CHI.

[30]  Kartik Talamadupula,et al.  Perfection Not Required? Human-AI Partnerships in Code Translation , 2021, IUI.

[31]  Guillaume Lample,et al.  Unsupervised Translation of Programming Languages , 2020, NeurIPS.

[32]  Isil Dillig,et al.  LambdaNet: Probabilistic Type Inference using Graph Neural Networks , 2020, ICLR.

[33]  Gabriele Bavota,et al.  An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation , 2018, ACM Trans. Softw. Eng. Methodol..

[34]  Lori L. Pollock,et al.  Automatic generation of natural language summaries for Java classes , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[35]  Koushik Sen,et al.  DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[36]  Ming-Yu Liu,et al.  Style Example-Guided Text Generation using Generative Adversarial Transformers , 2019, ArXiv.

[37]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[38]  Eric Horvitz,et al.  Principles of mixed-initiative user interfaces , 1999, CHI '99.

[39]  Ben Shneiderman,et al.  Human-Centered Artificial Intelligence: Reliable, Safe & Trustworthy , 2020, Int. J. Hum. Comput. Interact..

[40]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[41]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[42]  V. Braun,et al.  Using thematic analysis in psychology , 2006 .

[43]  Christian Bird,et al.  Deep learning type inference , 2018, ESEC/SIGSOFT FSE.

[44]  Aditya Kanade,et al.  Neural Program Repair by Jointly Learning to Localize and Repair , 2019, ICLR.

[45]  Beth Brownholtz,et al.  Voice user interface principles for a conversational agent , 2004, IUI '04.

[46]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[47]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[48]  Philip S. Yu,et al.  Improving Automatic Source Code Summarization via Deep Reinforcement Learning , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[49]  Oleksandr Polozov,et al.  Generative Code Modeling with Graphs , 2018, ICLR.

[50]  Carolyn A. Young,et al.  Achieving saturation in thematic analysis: development and refinement of a codebook. , 2014 .

[51]  John A. Biles,et al.  GenJam: evolution of a jazz improviser , 2001 .

[52]  Gabriele Bavota,et al.  Automatically assessing code understandability: How far are we? , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).