Generating Commit Messages from Diffs using Pointer-Generator Network

The commit messages in source code repositories are valuable but not easy to be generated manually in time for tracking issues, reporting bugs, and understanding codes. Recently published works indicated that the deep neural machine translation approaches have drawn considerable attentions on automatic generation of commit messages. However, they could not deal with out-of-vocabulary (OOV) words, which are essential context-specific identifiers such as class names and method names in code diffs. In this paper, we propose PtrGNCMsg, a novel approach which is based on an improved sequence-to-sequence model with the pointer-generator network to translate code diffs into commit messages. By searching the smallest identifier set with the highest probability, PtrGNCMsg outperforms recent approaches based on neural machine translation, and first enables the prediction of OOV words. The experimental results based on the corpus of diffs and manual commit messages from the top 2,000 Java projects in GitHub show that PtrGNCMsg outperforms the state-of-the-art approach with improved BLEU by 1.02, ROUGE-1 by 4.00 and ROUGE-L by 3.78, respectively.

[1]  Miryung Kim,et al.  A graph-based approach to API usage adaptation , 2010, OOPSLA.

[2]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[3]  Xiaodong Gu,et al.  Deep API learning , 2016, SIGSOFT FSE.

[4]  Collin McMillan,et al.  Towards Automatic Generation of Short Summaries of Commits , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[5]  Mario Linares Vásquez,et al.  ChangeScribe: A Tool for Automatically Generating Commit Messages , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[6]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[8]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[9]  Stas Negara,et al.  Mining fine-grained code changes to detect unknown change patterns , 2014, ICSE.

[10]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[11]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[12]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[13]  Coen De Roover,et al.  Mining Change Histories for Unknown Systematic Edits , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[14]  Ye Yuan,et al.  Review Networks for Caption Generation , 2016, NIPS.

[15]  Yue Wang,et al.  Code Completion with Neural Attention and Pointer Networks , 2017, IJCAI.

[16]  Kai Chen,et al.  Mining succinct and high-coverage API usage patterns from source code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[17]  Premkumar T. Devanbu,et al.  Are deep neural networks the best choice for modeling source code? , 2017, ESEC/SIGSOFT FSE.

[18]  Walid Maalej,et al.  Can development work describe itself? , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[19]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[20]  Collin McMillan,et al.  Automatically generating commit messages from diffs using neural machine translation , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[21]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[22]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[23]  Michael Philippsen,et al.  Automatic Clustering of Code Changes , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[24]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[25]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[26]  Westley Weimer,et al.  Automatically documenting program changes , 2010, ASE.

[27]  Martin P. Robillard,et al.  Temporal analysis of API usage concepts , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[28]  Nancy Chinchor,et al.  MUC-4 evaluation metrics , 1992, MUC.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Yutaka Matsuo,et al.  A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes , 2017, ACL.

[31]  Zhendong Su,et al.  On the naturalness of software , 2012, ICSE 2012.

[32]  Jonathan Sillito,et al.  Why are software projects moving from centralized to decentralized version control systems? , 2009, 2009 ICSE Workshop on Cooperative and Human Aspects on Software Engineering.

[33]  Dan Klein,et al.  Abstract Syntax Networks for Code Generation and Semantic Parsing , 2017, ACL.

[34]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[35]  Rainer Koschke,et al.  How do professional developers comprehend software? , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[36]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.