Exploiting Method Names to Improve Code Summarization: A Deliberation Multi-Task Learning Approach

Code summaries are brief natural language descriptions of source code pieces. The main purpose of code summarization is to assist developers in understanding code and to reduce documentation workload. In this paper, we design a novel multi-task learning (MTL) approach for code summarization through mining the relationship between method code summaries and method names. More specifically, since a method’s name can be considered as a shorter version of its code summary, we first introduce the tasks of generation and informativeness prediction of method names as two auxiliary training objectives for code summarization. A novel two-pass deliberation mechanism is then incorporated into our MTL architecture to generate more consistent intermediate states fed into a summary decoder, especially when informative method names do not exist. To evaluate our deliberation MTL approach, we carried out a large-scale experiment on two existing datasets for Java and Python. The experiment results show that our technique can be easily applied to many state-of-the-art neural models for code summarization and improve their performance. Meanwhile, our approach shows significant superiority when generating summaries for methods with non-informative names.

[1]  Minghui Zhou,et al.  A Neural Framework for Retrieval and Summarization of Source Code , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[2]  Einar W. Høst,et al.  Debugging Method Names , 2009, ECOOP.

[3]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[4]  He Jiang,et al.  Summarizing Software Artifacts: A Literature Review , 2016, Journal of Computer Science and Technology.

[5]  Collin McMillan,et al.  A Neural Model for Generating Natural Language Summaries of Program Subroutines , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[6]  Yves Le Traon,et al.  Learning to Spot and Refactor Inconsistent Method Names , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[7]  Collin McMillan,et al.  Automatic Source Code Summarization of Context for Java Methods , 2016, IEEE Transactions on Software Engineering.

[8]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[9]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[10]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[11]  Junkun Chen,et al.  Same Representation, Different Attentions: Shareable Sentence Representation Learning from Multiple Tasks , 2018, IJCAI.

[12]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[13]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[14]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[15]  Martial Hebert,et al.  Cross-Stitch Networks for Multi-task Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Sang Peter Chin,et al.  Automated software vulnerability detection with machine learning , 2018, ArXiv.

[17]  Chen Long,et al.  Keyword-Based Source Code Summarization , 2020 .

[18]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[19]  Nenghai Yu,et al.  Deliberation Networks: Sequence Generation Beyond One-Pass Decoding , 2017, NIPS.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  Huan Sun,et al.  CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning , 2019, WWW.

[22]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[23]  Shuai Lu,et al.  Summarizing Source Code with Transferred API Knowledge , 2018, IJCAI.

[24]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[25]  Xin Xia,et al.  Code Generation as a Dual Task of Code Summarization , 2019, NeurIPS.

[26]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[27]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[28]  Philip S. Yu,et al.  Improving Automatic Source Code Summarization via Deep Reinforcement Learning , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[29]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[30]  Wei Ye,et al.  Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning , 2020, WWW.

[31]  Rico Sennrich,et al.  A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation , 2017, IJCNLP.

[32]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[33]  Wei Ye,et al.  Exploiting Code Knowledge Graph for Bug Localization via Bi-directional Attention , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[34]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[35]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[36]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[37]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[38]  Collin McMillan,et al.  Recommendations for Datasets for Source Code Summarization , 2019, NAACL.

[39]  Long Chen,et al.  DeepLink: A Code Knowledge Graph Based Deep Learning Approach for Issue-Commit Link Recovery , 2019, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[40]  Tie-Yan Liu,et al.  Dual Learning for Machine Translation , 2016, NIPS.

[41]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[42]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.