ATOM: Commit Message Generation Based on Abstract Syntax Tree and Hybrid Ranking

Commit messages record code changes (e.g., feature modifications and bug repairs) in natural language, and are useful for program comprehension. Due to the frequent updates of software and time cost, developers are generally unmotivated to write commit messages for code changes. Therefore, automating the message writing process is necessitated. Previous studies on commit message generation have been benefited from generation models or retrieval models, but the code structure of changed code, i.e., AST, which can be important for capturing code semantics, has not been explicitly involved. Moreover, although generation models have the advantages of synthesizing commit messages for new code changes, they are not easy to bridge the semantic gap between code and natural languages which could be mitigated by retrieval models. In this paper, we propose a novel commit message generation model, named ATOM, which explicitly incorporates the abstract syntax tree for representing code changes and integrates both retrieved and generated messages through hybrid ranking. Specifically, the hybrid ranking module can prioritize the most accurate message from both retrieved and generated messages regarding one code change. We evaluate the proposed model ATOM on our dataset crawled from 56 popular Java repositories. Experimental results demonstrate that ATOM increases the state-of-the-art models by 30.72 percent in terms of BLEU-4 (an accuracy measure that is widely used to evaluate text generation systems). Qualitative analysis also demonstrates the effectiveness of ATOM in generating accurate code commit messages.

[1]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[2]  Lingling Fan,et al.  CORE: Automating Review Recommendation for Code Changes , 2019, 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[3]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[4]  M. Post,et al.  Generating Commit Messages from Git Diffs , 2019, ArXiv.

[5]  Michele Bezzi,et al.  Commit2Vec: Learning Distributed Representations of Code Changes , 2019, SN Computer Science.

[6]  Shuyao Jiang,et al.  Boosting Neural Commit Message Generation with Code Semantic Analysis , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[7]  He Jiang,et al.  Machine Learning Based Recommendation of Method Names: How Far are We , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[8]  Michael R. Lyu,et al.  Automating App Review Response Generation , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[9]  Shangqing Liu,et al.  Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks , 2019, NeurIPS.

[10]  Feng Xu,et al.  Commit Message Generation for Source Code Changes , 2019, IJCAI.

[11]  Yu Qian,et al.  Generating Commit Messages from Diffs using Pointer-Generator Network , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[12]  Xiaodong Liu,et al.  A Hybrid Retrieval-Generation Neural Conversation Model , 2019, CIKM.

[13]  Collin McMillan,et al.  Recommendations for Datasets for Source Code Summarization , 2019, NAACL.

[14]  Mikael Olsson Structs , 2018, Modern C Quick Syntax Reference.

[15]  Yutaka Matsuo,et al.  Content Aware Source Code Change Description Generation , 2018, INLG.

[16]  Marc Brockschmidt,et al.  Structured Neural Summarization , 2018, ICLR.

[17]  Zhenchang Xing,et al.  Neural-Machine-Translation-Based Commit Message Generation: How Far Are We? , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[18]  Philip S. Yu,et al.  Improving Automatic Source Code Summarization via Deep Reinforcement Learning , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[19]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[20]  Shuai Lu,et al.  Summarizing Source Code with Transferred API Knowledge , 2018, IJCAI.

[21]  Jian-Yun Nie,et al.  An Ensemble of Retrieval-Based and Generation-Based Human-Computer Conversation Systems , 2018, IJCAI.

[22]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[23]  Omer Levy,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[24]  Uri Alon,et al.  A general path-based representation for predicting program properties , 2018, PLDI.

[25]  Xiaonan Luo,et al.  Mining Version Control System for Automatically Generating Commit Comment , 2017, 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[26]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[27]  Collin McMillan,et al.  Automatically generating commit messages from diffs using neural machine translation , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[28]  Omer Levy,et al.  Zero-Shot Relation Extraction via Reading Comprehension , 2017, CoNLL.

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  Yutaka Matsuo,et al.  A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes , 2017, ACL.

[31]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[32]  Collin McMillan,et al.  Towards Automatic Generation of Short Summaries of Commits , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[33]  Rico Sennrich,et al.  Nematus: a Toolkit for Neural Machine Translation , 2017, EACL.

[34]  Paolo Frasconi,et al.  Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[35]  Quoc V. Le,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[36]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[37]  Bin Li,et al.  On Automatic Summarization of What and Why Information in Source Code Changes , 2016, 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC).

[38]  Xiaodong Gu,et al.  Deep API learning , 2016, SIGSOFT FSE.

[39]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[40]  Xueqi Cheng,et al.  Text Matching as Image Recognition , 2016, AAAI.

[41]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[42]  Jonathan I. Maletic,et al.  Using stereotypes in the automatic generation of natural language summaries for C++ methods , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[43]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[44]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[45]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[46]  Mario Linares Vásquez,et al.  ChangeScribe: A Tool for Automatically Generating Commit Messages , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[47]  Lin Tan,et al.  CloCom: Mining existing source code for automatic comment generation , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[50]  Hang Li,et al.  Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.

[51]  Mario Linares Vásquez,et al.  On Automatically Generating Commit Messages via Summarization of Source Code Changes , 2014, 2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation.

[52]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[53]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[54]  Collin McMillan,et al.  Improving topic model source code summarization , 2014, ICPC 2014.

[55]  Collin McMillan,et al.  Improving automated source code summarization via an eye-tracking study of programmers , 2014, ICSE.

[56]  Lori L. Pollock,et al.  Automatic generation of natural language summaries for Java classes , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[57]  Hridesh Rajan,et al.  Boa: A language and infrastructure for analyzing ultra-large-scale software repositories , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[58]  Kevin A Hallgren,et al.  Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. , 2012, Tutorials in quantitative methods for psychology.

[59]  Andrian Marcus,et al.  On the Use of Automated Text Summarization Techniques for Summarizing Source Code , 2010, 2010 17th Working Conference on Reverse Engineering.

[60]  Emily Hill,et al.  Towards automatically generating summary comments for Java methods , 2010, ASE.

[61]  Westley Weimer,et al.  Automatically documenting program changes , 2010, ASE.

[62]  Andrian Marcus,et al.  Supporting program comprehension with source code summarization , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[63]  Documentation , 2006 .

[64]  A. Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[65]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[66]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[67]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[68]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[69]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[70]  Emanuele Della Valle,et al.  An Introduction to Information Retrieval , 2013 .

[71]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[72]  R. Lewis An Introduction to Classification and Regression Tree (CART) Analysis , 2000 .

[73]  Jonathan Knudsen,et al.  Learning Java , 2000 .

[74]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[75]  Andy Davis,et al.  This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Tensorflow: a System for Large-scale Machine Learning Tensorflow: a System for Large-scale Machine Learning , 2022 .