CodeExp: Explanatory Code Document Generation

Developing models that can automatically generate detailed code explanation can greatly benefit software maintenance and programming education. However, existing code-to-text generation models often produce only high-level summaries of code that do not capture implementation-level choices essential for these scenarios. To fill in this gap, we propose the code explanation generation task. We first conducted a human study to identify the criteria for high-quality explanatory docstring for code. Based on that, we collected and refined a large-scale code docstring corpus and formulated automatic evaluation metrics that best match human assessments. Finally, we present a multi-stage fine-tuning strategy and baseline models for the task. Our experiments show that (1) our refined training dataset lets models achieve better performance in the explanation generation tasks compared to larger unrefined data (15x larger), and (2) fine-tuned models can generate well-structured long docstrings comparable to human-written ones. We envision our training dataset, human-evaluation protocol, recommended metrics, and fine-tuning strategy can boost future code explanation research. The code and annotated data are available at https://github.com/subercui/CodeExp.

[1]  Xin Xia,et al.  Practitioners' Expectations on Automated Code Comment Generation , 2022, 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE).

[2]  Yue Wang,et al.  CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , 2021, EMNLP.

[3]  Venera Arnaoudova,et al.  Reassessing automatic evaluation metrics for code summarization tasks , 2021, ESEC/SIGSOFT FSE.

[4]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[5]  Stella Biderman,et al.  GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , 2021 .

[6]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[7]  Neel Sundaresan,et al.  PyMT5: Multi-mode Translation of Natural Language and Python Code with Transformers , 2020, EMNLP.

[8]  Zhou Yu,et al.  Code to Comment “Translation”: Data, Metrics, Baselining & Evaluation , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[9]  Hailong Sun,et al.  Retrieval-based Neural Source Code Summarization , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[10]  Baishakhi Ray,et al.  A Transformer-based Approach for Source Code Summarization , 2020, ACL.

[11]  Ting Liu,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, FINDINGS.

[12]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[13]  Minxue Pan,et al.  Automatic Code Summarization: A Systematic Literature Review , 2019, ArXiv.

[14]  Hailong Sun,et al.  A Survey of Automatic Generation of Source Code Comments: Algorithms and Techniques , 2019, IEEE Access.

[15]  Taolue Chen,et al.  Augmenting Java method comments generation with context information based on neural networks , 2019, J. Syst. Softw..

[16]  Gabriele Bavota,et al.  A Large-Scale Empirical Study on Code-Comment Inconsistencies , 2019, 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC).

[17]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[18]  Ehud Reiter,et al.  A Structured Review of the Validity of BLEU , 2018, CL.

[19]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[20]  Rico Sennrich,et al.  A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation , 2017, IJCNLP.

[21]  Collin McMillan,et al.  Automatic documentation generation via source code summarization of method context , 2014, ICPC 2014.

[22]  Jinqiu Yang,et al.  AutoComment: Mining question and answer sites for automatic comment generation , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[23]  Lori L. Pollock,et al.  JSummarizer: An automatic generator of natural language summaries for Java classes , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[24]  Andrian Marcus,et al.  Supporting program comprehension with source code summarization , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[25]  Sun-Jen Huang,et al.  An empirical analysis of the impact of software development problem factors on software maintainability , 2009, J. Syst. Softw..

[26]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[27]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[28]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[29]  R. Likert “Technique for the Measurement of Attitudes, A” , 2022, The SAGE Encyclopedia of Research Design.

[30]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[31]  Vahid Garousi,et al.  Usage and usefulness of technical software documentation: An industrial case study , 2015, Inf. Softw. Technol..

[32]  Timothy Baldwin,et al.  Accurate Evaluation of Segment-level Machine Translation Metrics , 2015, NAACL.

[33]  M. Riconscente,et al.  Technique for the measurement of attitudes , 2010 .