论文信息 - Multi-lingual Evaluation of Code Generation Models

Multi-lingual Evaluation of Code Generation Models

We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language. Using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities even on mono-lingual settings. Furthermore, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other code-related evaluations such as code insertion, robustness, or summarization tasks. Overall, our benchmarks represents a significant step towards a deeper understanding of language models' code generation abilities. We publicly release our code and datasets at https://github.com/amazon-research/mxeval.

[1] J. Malmaud,et al. Measuring The Impact Of Programming Language Distribution , 2023, ArXiv.

[2] N. Polikarpova,et al. Grounded Copilot: How Programmers Interact with Code-Generating Models , 2022, Proc. ACM Program. Lang..

[3] Frank F. Xu,et al. MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages , 2022, FINDINGS.

[4] Gerard de Melo,et al. NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation , 2021, Northern European Journal of Language Technology.

[5] Hao Wang,et al. CCTEST: Testing and Repairing Code Completion Systems , 2022, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

[6] Junyi Jessy Li,et al. CoditT5: Pretraining for Source Code and Natural Language Editing , 2022, International Conference on Automated Software Engineering.

[7] Gabriel Synnaeve,et al. Code Translation with Compiler Representations , 2022, ICLR.

[8] Chandan K. Reddy,et al. XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence , 2022, ArXiv.

[9] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[10] Sida I. Wang,et al. Natural Language to Code Translation with Execution , 2022, EMNLP.

[11] Sida I. Wang,et al. InCoder: A Generative Model for Code Infilling and Synthesis , 2022, ICLR.

[12] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[13] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.

[14] Frank F. Xu,et al. A systematic evaluation of large language models of code , 2022, MAPS@PLDI.

[15] Sumit Gulwani,et al. Synchromesh: Reliable code generation from pre-trained language models , 2022, ICLR.

[16] S. Savarese,et al. A Conversational Paradigm for Program Synthesis , 2022, ArXiv.

[17] Shin Hwei Tan,et al. Improving automatically generated code from Codex via Automated Program Repair , 2022, ArXiv.