Neural Code Summarization: How Far Are We?

Source code summaries are important for the comprehension and maintenance of programs. However, there are plenty of programs with missing, outdated, or mismatched summaries. Recently, deep learning techniques have been exploited to automatically generate summaries for given code snippets. To achieve a profound understanding of how far we are from solving this problem, in this paper, we conduct a systematic and in-depth analysis of five state-of-the-art neural source code summarization models on three widely used datasets. Our evaluation results suggest that: (1) The BLEU metric, which is widely used by existing work for evaluating the performance of the summarization models, has many variants. Ignoring the differences among the BLEU variants could affect the validity of the claimed results. Furthermore, we discover an important, previously unknown bug about BLEU calculation in a commonlyused software package. (2) Code pre-processing choices can have a large impact on the summarization performance, therefore they should not be ignored. (3) Some important characteristics of datasets (corpus size, data splitting method, and duplication ratio) have a significant impact on model evaluation. Based on the experimental results, we give some actionable guidelines on more systematic ways for evaluating code summarization and choosing the best method in different scenarios. We also suggest possible future research directions. We believe that our results can be of great help for practitioners and researchers in this interesting area.

[1]  Miryung Kim,et al.  An empirical study of code clone genealogies , 2005, ESEC/FSE-13.

[2]  Zachary Eberhart,et al.  A Human Study of Comprehension and Code Summarization , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[3]  Yuxiang Zhu,et al.  Automatic Code Summarization: A Systematic Literature Review , 2019, ArXiv.

[4]  Colin Cherry,et al.  A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU , 2014, WMT@ACL.

[5]  Collin McMillan,et al.  Recommendations for Datasets for Source Code Summarization , 2019, NAACL.

[6]  Collin McMillan,et al.  Improved Automatic Summarization of Subroutines via Attention to File Context , 2020, 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR).

[7]  Xin Xia,et al.  Code Generation as a Dual Task of Code Summarization , 2019, NeurIPS.

[8]  Yang Liu,et al.  ATOM: Commit Message Generation Based on Abstract Syntax Tree and Hybrid Ranking , 2019, ArXiv.

[9]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[10]  Yuanyuan Zhou,et al.  CP-Miner: finding copy-paste and related bugs in large-scale software code , 2006, IEEE Transactions on Software Engineering.

[11]  Nicolas Usunier,et al.  Improving Neural Language Models with a Continuous Cache , 2016, ICLR.

[12]  Hailong Sun,et al.  Retrieval-based Neural Source Code Summarization , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[13]  Lori L. Pollock,et al.  Automatic generation of natural language summaries for Java classes , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[14]  David Lo,et al.  Deep code comment generation with hybrid lexical and syntactical information , 2019, Empirical Software Engineering.

[15]  Hausi A. Müller,et al.  Documenting software systems with views , 1992, SIGDOC '92.

[16]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[17]  Andrea Janes,et al.  Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[18]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[19]  Marc Brockschmidt,et al.  Structured Neural Summarization , 2018, ICLR.

[20]  Andrian Marcus,et al.  Supporting program comprehension with source code summarization , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[21]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Philip S. Yu,et al.  Improving Automatic Source Code Summarization via Deep Reinforcement Learning , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[23]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[24]  Andrian Marcus,et al.  On the Use of Automated Text Summarization Techniques for Summarizing Source Code , 2010, 2010 17th Working Conference on Reverse Engineering.

[25]  Collin McMillan,et al.  A Neural Model for Generating Natural Language Summaries of Program Subroutines , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[26]  Wei Ye,et al.  Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning , 2020, WWW.

[27]  Timothy Lethbridge,et al.  The relevance of software documentation, tools and technologies: a survey , 2002, DocEng '02.

[28]  Baishakhi Ray,et al.  A Transformer-based Approach for Source Code Summarization , 2020, ACL.

[29]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[30]  Rongxin Wu,et al.  Improving Code Summarization with Block-wise Abstract Syntax Tree Splitting , 2021, 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC).

[31]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[32]  Lionel C. Briand,et al.  Software documentation: how much is enough? , 2003, Seventh European Conference onSoftware Maintenance and Reengineering, 2003. Proceedings..

[33]  Jeffrey C. Carver,et al.  Evaluating source code summarization techniques: Replication and expansion , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[34]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[35]  Junyi Jessy Li,et al.  Learning to Update Natural Language Comments Based on Code Changes , 2020, ACL.

[36]  Collin McMillan,et al.  Improved Code Summarization via a Graph Neural Network , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[37]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[38]  Minghui Zhou,et al.  A Neural Framework for Retrieval and Summarization of Source Code , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[39]  Jan Vitek,et al.  DéjàVu: a map of code duplicates on GitHub , 2017, Proc. ACM Program. Lang..

[40]  He Jiang,et al.  Summarizing Software Artifacts: A Literature Review , 2016, Journal of Computer Science and Technology.

[41]  A. Mockus,et al.  Large-Scale Code Reuse in Open Source Software , 2007, First International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS'07: ICSE Workshops 2007).

[42]  Philip S. Yu,et al.  Reinforcement-Learning-Guided Source Code Summarization Using Hierarchical Attention , 2022, IEEE Transactions on Software Engineering.

[43]  James Glass,et al.  Modelling out-of-vocabulary words for robust speech recognition , 2002 .

[44]  Shikun Zhang,et al.  Exploiting Method Names to Improve Code Summarization: A Deliberation Multi-Task Learning Approach , 2021, 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC).

[45]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[46]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[47]  Collin McMillan,et al.  Improving automated source code summarization via an eye-tracking study of programmers , 2014, ICSE.

[48]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[49]  Hailong Sun,et al.  A Survey of Automatic Generation of Source Code Comments: Algorithms and Techniques , 2019, IEEE Access.

[50]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[51]  Shuai Lu,et al.  Summarizing Source Code with Transferred API Knowledge , 2018, IJCAI.

[52]  Bolin Wei,et al.  Retrieve and Refine: Exemplar-Based Neural Comment Generation , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[53]  Emily Hill,et al.  Towards automatically generating summary comments for Java methods , 2010, ASE.

[54]  Aakash Bansal,et al.  Project-Level Encoding for Neural Source Code Summarization of Subroutines , 2021, 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC).

[55]  Premkumar T. Devanbu,et al.  Are deep neural networks the best choice for modeling source code? , 2017, ESEC/SIGSOFT FSE.

[56]  Zhou Yu,et al.  Code to Comment “Translation”: Data, Metrics, Baselining & Evaluation , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[57]  Zijian Li,et al.  TAG : Type Auxiliary Guiding for Code Comment Generation , 2020, ACL.

[58]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[59]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.