CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow. We conduct experiments by evaluating the correlation coefficient between CodeBLEU and quality scores assigned by the programmers on three code synthesis tasks, i.e., text-to-code, code translation, and code refinement. Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy.

[1]  Guillaume Lample,et al.  Unsupervised Translation of Programming Languages , 2020, NeurIPS.

[2]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[3]  Neel Sundaresan,et al.  IntelliCode compose: code generation using transformer , 2020, ESEC/SIGSOFT FSE.

[4]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[5]  Monperrus Martin Automatic Software Repair: a Bibliography , 2020 .

[6]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[7]  Andrew D. Gordon,et al.  Bimodal Modelling of Source Code and Natural Language , 2015, ICML.

[8]  Ming Zhou,et al.  Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing , 2019, ACL.

[9]  Tomoki Toda,et al.  Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[10]  Gabriele Bavota,et al.  An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation , 2018, ACM Trans. Softw. Eng. Methodol..

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Alvin Cheung,et al.  Mapping Language to Code in Programmatic Context , 2018, EMNLP.

[13]  Le Song,et al.  Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs , 2020, ICLR.

[14]  Dawn Xiaodong Song,et al.  Tree-to-tree Neural Networks for Program Translation , 2018, NeurIPS.

[15]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[16]  Dan Klein,et al.  Abstract Syntax Networks for Code Generation and Semantic Parsing , 2017, ACL.

[17]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[18]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[19]  Ming Zhou,et al.  GraphCodeBERT: Pre-training Code Representations with Data Flow , 2020, ArXiv.

[20]  Jiajun Zhang,et al.  Sequence Generation: From Both Sides to the Middle , 2019, IJCAI.

[21]  Paul Garvin On Machine Translation , 1972 .

[22]  R. Darnell Translation , 1873, The Indian medical gazette.

[23]  Rico Sennrich,et al.  A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation , 2017, IJCNLP.

[24]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[25]  Anh Tuan Nguyen,et al.  Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[26]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[27]  Aditya Kanade,et al.  Pre-trained Contextual Embedding of Source Code , 2019, ArXiv.

[28]  Martin T. Vechev,et al.  Phrase-Based Statistical Translation of Programming Languages , 2014, Onward!.

[29]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.