Code Structure Guided Transformer for Source Code Summarization

Source code summarization aims at generating concise descriptions of given programs’ functionalities. While Transformer-based approaches achieve promising performance, they do not explicitly incorporate the code structure information which is important for capturing code semantics. Besides, without explicit constraints, multi-head attentions in Transformer may suffer from attention collapse, leading to poor code representations for summarization. Effectively integrating the code structure information into Transformer is under-explored in this task domain. In this paper, we propose a novel approach named SG-Trans to incorporate code structural properties into Transformer. Specifically, to capture the hierarchical characteristics of code, we inject the local symbolic information (e.g., code tokens) and global syntactic structure (e.g., data flow) into the self-attention module as inductive bias. Extensive evaluation shows the superior performance of SG-Trans over the state-of-the-art approaches.

[1]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[2]  Collin McMillan,et al.  Improved Code Summarization via a Graph Neural Network , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[3]  Hailong Sun,et al.  A Novel Neural Source Code Representation Based on Abstract Syntax Tree , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[4]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[5]  Andrea Janes,et al.  Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[6]  Miltiadis Allamanis,et al.  The adverse effects of code duplication in machine learning models of code , 2018, Onward!.

[7]  Pengfei Liu,et al.  Multi-Scale Self-Attention for Text Classification , 2019, AAAI.

[8]  Philip S. Yu,et al.  Improving Automatic Source Code Summarization via Deep Reinforcement Learning , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[9]  Kenny Q. Zhu,et al.  Automatic Generation of Text Descriptive Comments for Code Blocks , 2018, AAAI.

[10]  Rico Sennrich,et al.  A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation , 2017, IJCNLP.

[11]  Lori L. Pollock,et al.  Automatic generation of natural language summaries for Java classes , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[12]  Shuai Lu,et al.  Summarizing Source Code with Transferred API Knowledge , 2018, IJCAI.

[13]  Akihiro Yamamoto,et al.  Automatic Source Code Summarization with Extended Tree-LSTM , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[14]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[15]  Andrian Marcus,et al.  Supporting program comprehension with source code summarization , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[16]  Michael Johnson,et al.  Compositionality , 2020, The Wiley Blackwell Companion to Semantics.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  Lin Tan,et al.  CloCom: Mining existing source code for automatic comment generation , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[19]  Bolin Wei,et al.  Retrieve and Refine: Exemplar-Based Neural Comment Generation , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[20]  Emily Hill,et al.  Towards automatically generating summary comments for Java methods , 2010, ASE.

[21]  Anima Anandkumar,et al.  Open Vocabulary Learning on Source Code with a Graph-Structured Cache , 2018, ICML.

[22]  Hailong Sun,et al.  Retrieval-based Neural Source Code Summarization , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[23]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[24]  Marc Brockschmidt,et al.  Structured Neural Summarization , 2018, ICLR.

[25]  Yifan Hu,et al.  Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference , 2020, EMNLP.

[26]  Baishakhi Ray,et al.  A Transformer-based Approach for Source Code Summarization , 2020, ACL.

[27]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[28]  Junji Tomita,et al.  Multi-style Generative Reading Comprehension , 2019, ACL.

[29]  William W. Cohen,et al.  Natural Language Models for Predicting Programming Comments , 2013, ACL.

[30]  Collin McMillan,et al.  Automatic Source Code Summarization of Context for Java Methods , 2016, IEEE Transactions on Software Engineering.

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[33]  Gavin McArdle,et al.  Improved Graph Neural Networks for Spatial Networks Using Structure-Aware Sampling , 2020, ISPRS Int. J. Geo Inf..

[34]  Ming Zhou,et al.  GraphCodeBERT: Pre-training Code Representations with Data Flow , 2020, ICLR.

[35]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[36]  Yoshimasa Tsuruoka,et al.  Tree-to-Sequence Attentional Neural Machine Translation , 2016, ACL.

[37]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[38]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.