Code to Comment Translation: A Comparative Study on Model Effectiveness & Errors

Automated source code summarization is a popular software engineering research topic wherein machine translation models are employed to “translate” code snippets into relevant natural language descriptions. Most evaluations of such models are conducted using automatic reference-based metrics. However, given the relatively large semantic gap between programming languages and natural language, we argue that this line of research would benefit from a qualitative investigation into the various error modes of current state-of-the-art models. Therefore, in this work, we perform both a quantitative and qualitative comparison of three recently proposed source code summarization models. In our quantitative evaluation, we compare the models based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics, and in our qualitative evaluation, we perform a manual open-coding of the most common errors committed by the models when compared to ground truth captions. Our investigation reveals new insights into the relationship between metric-based performance and model prediction errors grounded in an error taxonomy that can be used to drive future research efforts.

[1]  G. Bavota,et al.  Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[2]  A. Huberman,et al.  Qualitative Data Analysis: A Methods Sourcebook , 1994 .

[3]  Collin McMillan,et al.  Improved Code Summarization via a Graph Neural Network , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[4]  Collin McMillan,et al.  Recommendations for Datasets for Source Code Summarization , 2019, NAACL.

[5]  Zhiguo Wang,et al.  Coverage Embedding Models for Neural Machine Translation , 2016, EMNLP.

[6]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[7]  Jeffrey C. Carver,et al.  Evaluating source code summarization techniques: Replication and expansion , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[8]  Gholamreza Haffari,et al.  Incorporating Structural Alignment Biases into an Attentional Neural Translation Model , 2016, NAACL.

[9]  Hailong Sun,et al.  Retrieval-based Neural Source Code Summarization , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[10]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[11]  Andrea De Lucia,et al.  Recovering traceability links between unit tests and classes under test: An improved method , 2010, 2010 IEEE International Conference on Software Maintenance.

[12]  Zhou Yu,et al.  Code to Comment “Translation”: Data, Metrics, Baselining & Evaluation , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[13]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[14]  Yang Liu,et al.  Retrieval-Augmented Generation for Code Summarization via Hybrid GNN , 2020, ICLR.

[15]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[16]  Sun-Jen Huang,et al.  An empirical analysis of the impact of software development problem factors on software maintainability , 2009, J. Syst. Softw..

[17]  Agile Manifesto,et al.  Manifesto for Agile Software Development , 2001 .

[18]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[19]  Martin P. Robillard,et al.  Code fragment summarization , 2013, ESEC/FSE 2013.

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  Collin McMillan,et al.  Automatic documentation generation via source code summarization of method context , 2014, ICPC 2014.

[23]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[24]  Andrian Marcus,et al.  Supporting program comprehension with source code summarization , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[25]  Collin McMillan,et al.  A Neural Model for Generating Natural Language Summaries of Program Subroutines , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[26]  Baishakhi Ray,et al.  A Transformer-based Approach for Source Code Summarization , 2020, ACL.

[27]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[28]  Amy Nicole Langville,et al.  Google's PageRank and beyond - the science of search engine rankings , 2006 .

[29]  Vahid Garousi,et al.  Usage and usefulness of technical software documentation: An industrial case study , 2015, Inf. Softw. Technol..

[30]  Aakash Bansal,et al.  Project-Level Encoding for Neural Source Code Summarization of Subroutines , 2021, 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC).

[31]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[32]  Yuxiang Zhu,et al.  Automatic Code Summarization: A Systematic Literature Review , 2019, ArXiv.

[33]  Zachary Eberhart,et al.  A Human Study of Comprehension and Code Summarization , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[34]  Samuel B. Williams,et al.  ASSOCIATION FOR COMPUTING MACHINERY , 2000 .

[35]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[36]  Collin McMillan,et al.  Automatic Source Code Summarization of Context for Java Methods , 2016, IEEE Transactions on Software Engineering.

[37]  Neel Sundaresan,et al.  CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation , 2021, NeurIPS Datasets and Benchmarks.

[38]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[39]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[40]  Vahid Garousi,et al.  Cost, benefits and quality of software development documentation: A systematic mapping , 2015, J. Syst. Softw..

[41]  Philip S. Yu,et al.  Improving Automatic Source Code Summarization via Deep Reinforcement Learning , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[42]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[43]  Gabriele Bavota,et al.  Enabling mutation testing for Android apps , 2017, ESEC/SIGSOFT FSE.

[44]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[45]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.