Ensemble Models for Neural Source Code Summarization of Subroutines

A source code summary of a subroutine is a brief description of that subroutine. Summaries underpin a majority of documentation consumed by programmers, such as the method summaries in JavaDocs. Source code summarization is the task of writing these summaries. At present, most state-of-the-art approaches for code summarization are neural network-based solutions akin to seq2seq, graph2seq, and other encoder-decoder architectures. The input to the encoder is source code, while the decoder helps predict the natural language summary. While these models tend to be similar in structure, evidence is emerging that different models make different contributions to prediction quality – differences in model performance are orthogonal and complementary rather than uniform over the entire dataset. In this paper, we explore the orthogonal nature of different neural code summarization approaches and propose ensemble models to exploit this orthogonality for better overall performance. We demonstrate that a simple ensemble strategy boosts performance by up to 14.8%, and provide an explanation for this boost. The takeaway from this work is that a relatively small change to the inference procedure in most neural code summarization techniques leads to outsized improvements in prediction quality.

[1]  Sameer Al-Dahidi,et al.  Ensemble Approach of Optimized Artificial Neural Networks for Solar Photovoltaic Power Prediction , 2019, IEEE Access.

[2]  Stephan Rasp,et al.  Neural networks for post-processing ensemble weather forecasts , 2018, Monthly Weather Review.

[3]  Kenny Q. Zhu,et al.  Automatic Generation of Text Descriptive Comments for Code Blocks , 2018, AAAI.

[4]  Collin McMillan,et al.  Improved Automatic Summarization of Subroutines via Attention to File Context , 2020, 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR).

[5]  Collin McMillan,et al.  Recommendations for Datasets for Source Code Summarization , 2019, NAACL.

[6]  Andrian Marcus,et al.  On the Use of Automated Text Summarization Techniques for Summarizing Source Code , 2010, 2010 17th Working Conference on Reverse Engineering.

[7]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[8]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[9]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[10]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[11]  Collin McMillan,et al.  A Neural Model for Generating Natural Language Summaries of Program Subroutines , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[12]  Baishakhi Ray,et al.  A Transformer-based Approach for Source Code Summarization , 2020, ACL.

[13]  Rico Sennrich,et al.  Nematus: a Toolkit for Neural Machine Translation , 2017, EACL.

[14]  Fengrong Zhao,et al.  A Survey of Automatic Generation of Code Comments , 2020, ICMSS.

[15]  Zelong Zhao,et al.  Learning to Generate Comments for API-Based Code Snippets , 2017 .

[16]  Yang Liu,et al.  Retrieval-Augmented Generation for Code Summarization via Hybrid GNN , 2020, ICLR.

[17]  Douglas Kramer,et al.  API documentation from source code comments: a case study of Javadoc , 1999, SIGDOC '99.

[18]  Zhenyu Wang,et al.  Design Ensemble Machine Learning Model for Breast Cancer Diagnosis , 2012, Journal of Medical Systems.

[19]  Timothy Lethbridge,et al.  The relevance of software documentation, tools and technologies: a survey , 2002, DocEng '02.

[20]  Zhendong Su,et al.  Detecting API documentation errors , 2013, OOPSLA.

[21]  Christof Monz,et al.  Ensemble Learning for Multi-Source Neural Machine Translation , 2016, COLING.

[22]  Hailong Sun,et al.  A Survey of Automatic Generation of Source Code Comments: Algorithms and Techniques , 2019, IEEE Access.

[23]  Yansong Feng,et al.  Graph2Seq: Graph to Sequence Learning with Attention-based Neural Networks , 2018, ArXiv.

[24]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[25]  Muhammad Asad,et al.  Optimized Stock market prediction using ensemble learning , 2015, 2015 9th International Conference on Application of Information and Communication Technologies (AICT).

[26]  Miltiadis Allamanis,et al.  The adverse effects of code duplication in machine learning models of code , 2018, Onward!.

[27]  Yutaka Matsuo,et al.  A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes , 2017, ACL.

[28]  Jure Leskovec,et al.  Language-Agnostic Representation Learning of Source Code from Structure and Context , 2021, ICLR.

[29]  Chenhui Chu,et al.  A Survey of Multilingual Neural Machine Translation , 2019, ACM Comput. Surv..

[30]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[31]  Tao Xie,et al.  An Empirical Study on Evolution of API Documentation , 2011, FASE.

[32]  Shuai Lu,et al.  Summarizing Source Code with Transferred API Knowledge , 2018, IJCAI.

[33]  Svetlana Borovkova,et al.  An Ensemble of LSTM Neural Networks for High-Frequency Stock Market Classification , 2018, Journal of Forecasting.

[34]  Aakash Bansal,et al.  Project-Level Encoding for Neural Source Code Summarization of Subroutines , 2021, 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC).

[35]  Marc Brockschmidt,et al.  Structured Neural Summarization , 2018, ICLR.

[36]  Lior Rokach,et al.  Ensemble learning: A survey , 2018, WIREs Data Mining Knowl. Discov..

[37]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[38]  Collin McMillan,et al.  Improved Code Summarization via a Graph Neural Network , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[39]  Md. Zakir Hossain,et al.  A Comprehensive Survey of Deep Learning for Image Captioning , 2018, ACM Comput. Surv..

[40]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.