Multi-lingual Evaluation of Code Generation Models

We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language. Using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities even on mono-lingual settings. Furthermore, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other code-related evaluations such as code insertion, robustness, or summarization tasks. Overall, our benchmarks represents a significant step towards a deeper understanding of language models' code generation abilities. We publicly release our code and datasets at https://github.com/amazon-research/mxeval.

[1]  J. Malmaud,et al.  Measuring The Impact Of Programming Language Distribution , 2023, ArXiv.

[2]  N. Polikarpova,et al.  Grounded Copilot: How Programmers Interact with Code-Generating Models , 2022, Proc. ACM Program. Lang..

[3]  Frank F. Xu,et al.  MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages , 2022, FINDINGS.

[4]  Gerard de Melo,et al.  NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation , 2021, Northern European Journal of Language Technology.

[5]  Hao Wang,et al.  CCTEST: Testing and Repairing Code Completion Systems , 2022, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

[6]  Junyi Jessy Li,et al.  CoditT5: Pretraining for Source Code and Natural Language Editing , 2022, International Conference on Automated Software Engineering.

[7]  Gabriel Synnaeve,et al.  Code Translation with Compiler Representations , 2022, ICLR.

[8]  Chandan K. Reddy,et al.  XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence , 2022, ArXiv.

[9]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[10]  Sida I. Wang,et al.  Natural Language to Code Translation with Execution , 2022, EMNLP.

[11]  Sida I. Wang,et al.  InCoder: A Generative Model for Code Infilling and Synthesis , 2022, ICLR.

[12]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[13]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[14]  Frank F. Xu,et al.  A systematic evaluation of large language models of code , 2022, MAPS@PLDI.

[15]  Sumit Gulwani,et al.  Synchromesh: Reliable code generation from pre-trained language models , 2022, ICLR.

[16]  S. Savarese,et al.  A Conversational Paradigm for Program Synthesis , 2022, ArXiv.

[17]  Shin Hwei Tan,et al.  Improving automatically generated code from Codex via Automated Program Repair , 2022, ArXiv.

[18]  Zhilin Yang,et al.  P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks , 2021, ArXiv.

[19]  Neel Sundaresan,et al.  Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy , 2021, EMNLP.

[20]  Yue Wang,et al.  CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , 2021, EMNLP.

[21]  Md Golam Rahman Tushar,et al.  AVATAR: A Parallel Corpus for Java-Python Program Translation , 2021, ACL.

[22]  Charles Sutton,et al.  Program Synthesis with Large Language Models , 2021, ArXiv.

[23]  Dawn Song,et al.  Measuring Coding Challenge Competence With APPS , 2021, NeurIPS Datasets and Benchmarks.

[24]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[25]  Kai-Wei Chang,et al.  Unified Pre-training for Program Understanding and Generation , 2021, NAACL.

[26]  Neel Sundaresan,et al.  CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation , 2021, NeurIPS Datasets and Benchmarks.

[27]  Ming Zhou,et al.  GraphCodeBERT: Pre-training Code Representations with Data Flow , 2020, ICLR.

[28]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[29]  Olatunji Ruwase,et al.  DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.

[30]  Guillaume Lample,et al.  Unsupervised Translation of Programming Languages , 2020, NeurIPS.

[31]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[32]  Ting Liu,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, FINDINGS.

[33]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[34]  Naoki Yoshinaga,et al.  Data augmentation using back-translation for context-aware neural machine translation , 2019, EMNLP.

[35]  Lucia Specia,et al.  Improving Neural Machine Translation Robustness via Data Augmentation: Beyond Back-Translation , 2019, EMNLP.

[36]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[37]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[38]  Percy Liang,et al.  SPoC: Search-based Pseudocode to Code , 2019, NeurIPS.

[39]  Pradeep Dubey,et al.  A Study of BFLOAT16 for Deep Learning Training , 2019, ArXiv.

[40]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[41]  Moritz Schubotz,et al.  Introducing MathQA - A Math-Aware Question Answering System , 2018, Information Discovery and Delivery.

[42]  Graham Neubig,et al.  Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[43]  Dawn Xiaodong Song,et al.  Tree-to-tree Neural Networks for Program Translation , 2018, NeurIPS.

[44]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[45]  Rahul Gupta,et al.  DeepFix: Fixing Common C Language Errors by Deep Learning , 2017, AAAI.

[46]  Martin T. Vechev,et al.  Probabilistic model for code with decision trees , 2016, OOPSLA.

[47]  Abram Hindle,et al.  Using machine translation for converting Python 2 to Python 3 code , 2015, PeerJ Prepr..

[48]  Martin T. Vechev,et al.  Phrase-Based Statistical Translation of Programming Languages , 2014, Onward!.

[49]  Anh Tuan Nguyen,et al.  Lexical statistical machine translation for language migration , 2013, ESEC/FSE 2013.

[50]  Charles A. Sutton,et al.  Mining source code repositories at massive scale using language modeling , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[51]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.