HAConvGNN: Hierarchical Attention Based Convolutional Graph Neural Network for Code Documentation Generation in Jupyter Notebooks

Many data scientists use Jupyter notebook to experiment code, visualize results, and document rationales or interpretations. The code documentation generation (CDG) task in notebooks is related but different from the code summarization task in software engineering, as one documentation (markdown cell) may consist of a text (informative summary or indicative rationale) for multiple code cells. Our work aims to solve the CDG task by encoding the multiple code cells as separated AST graph structures, for which we propose a hierarchical attentionbased ConvGNN component to augment the Seq2Seq network. We build a dataset with publicly available Kaggle notebooks and evaluate our model (HAConvGNN) against baseline models (e.g., Code2Seq or Graph2Seq).

[1]  Amy X. Zhang,et al.  How do Data Science Workers Collaborate? Roles, Workflows, and Tools , 2020, Proc. ACM Hum. Comput. Interact..

[2]  Souti Chattopadhyay,et al.  What's Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities , 2020, CHI.

[3]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[4]  Aakash Bansal,et al.  Action Word Prediction for Neural Source Code Summarization , 2021, 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER).

[5]  Shuai Lu,et al.  Summarizing Source Code with Transferred API Knowledge , 2018, IJCAI.

[6]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[7]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[8]  Collin McMillan,et al.  Recommendations for Datasets for Source Code Summarization , 2019, NAACL.

[9]  Parikshit Ram,et al.  AutoAI: Automating the End-to-End AI Lifecycle with Humans-in-the-Loop , 2020, IUI Companion.

[10]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[11]  Collin McMillan,et al.  Improved Automatic Summarization of Subroutines via Attention to File Context , 2020, 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR).

[12]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[13]  Andrian Marcus,et al.  Supporting program comprehension with source code summarization , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[14]  Soya Park,et al.  How Much Automation Does a Data Scientist Want? , 2021, ArXiv.

[15]  Yansong Feng,et al.  Graph2Seq: Graph to Sequence Learning with Attention-based Neural Networks , 2018, ArXiv.

[16]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Mohammed J. Zaki,et al.  Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation , 2019, ICLR.

[19]  Emily Hill,et al.  Towards automatically generating summary comments for Java methods , 2010, ASE.

[20]  Paul W. McBurney Automatic Documentation Generation via Source Code Summarization , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[21]  Collin McMillan,et al.  Improving automated source code summarization via an eye-tracking study of programmers , 2014, ICSE.

[22]  Brad A. Myers,et al.  Exploring exploratory programming , 2017, 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[23]  Collin McMillan,et al.  Improved Code Summarization via a Graph Neural Network , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[24]  Christopher A. Brooks,et al.  What Makes a Well-Documented Notebook? A Case Study of Data Scientists’ Documentation Practices in Kaggle , 2021, CHI Extended Abstracts.

[25]  Andrian Marcus,et al.  On the Use of Automated Text Summarization Techniques for Summarizing Source Code , 2010, 2010 17th Working Conference on Reverse Engineering.

[26]  Soya Park,et al.  Themisto: Towards Automated Documentation Generation in Computational Notebooks , 2021, ArXiv.

[27]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[28]  Luke Zettlemoyer,et al.  JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation , 2019, EMNLP.

[29]  Lori L. Pollock,et al.  Automatically mining software-based, semantically-similar words from comment-code mappings , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[30]  James D. Hollan,et al.  Exploration and Explanation in Computational Notebooks , 2018, CHI.

[31]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[32]  Guillaume Lample,et al.  Unsupervised Translation of Programming Languages , 2020, NeurIPS.