Evaluating Large Language Models Trained on Code

We introduce Codex, a GPT language model finetuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics. Equal contribution OpenAI, San Francisco, California, USA. Anthropic AI, San Francisco, California, USA. Work performed while at OpenAI. Zipline, South San Francisco, California, USA. Work performed while at OpenAI. Correspondence to: Mark Chen <mark@openai.com>, Jerry Tworek <jt@openai.com>, Heewoo Jun <heewoo@openai.com>, Qiming Yuan <qiming@openai.com>.

Wojciech Zaremba | Lukasz Kaiser | Peter Welinder | Ilya Sutskever | Jan Leike | Alec Radford | Heewoo Jun | Mohammad Bavarian | Dario Amodei | Miles Brundage | Bob McGrew | Scott Gray | Nick Ryder | Sam McCandlish | Qiming Yuan | Heidy Khlaaf | Raul Puri | Girish Sastry | Matthias Plappert | Ariel Herbert-Voss | Brooke Chan | Alex Nichol | Yura Burda | William H. Guss | Evan Morikawa | Philippe Tillet | Vedant Misra | Igor Babuschkin | Matthew Knight | Clemens Winter | Pamela Mishkin | Gretchen Krueger | Nicholas Joseph | Jared Kaplan | Alethea Power | Harri Edwards | Mikhail Pavlov | Mark Chen | Alex Ray | Jerry Tworek | Suchir Balaji | Shantanu Jain | Henrique Ponde | Greg Brockman | Michael Petrov | Felipe Such | Dave Cummings | Fotios Chantzis | Elizabeth Barnes | Will Guss | Andrew Carr | Josh Achiam | Mira Murati | Katie Mayer | Alec Radford | Dario Amodei | Raul Puri | Lukasz Kaiser | I. Babuschkin | Ilya Sutskever | Wojciech Zaremba | Alex Nichol | Joshua Achiam | Mark Chen | Heewoo Jun | Scott Gray | Nick Ryder | Sam McCandlish | Bob McGrew | Harrison Edwards | Greg Brockman | P. Welinder | F. Such | Alex Ray | Girish Sastry | Ariel Herbert-Voss | Gretchen Krueger | Clemens Winter | Miles Brundage | Brooke Chan | Michael Petrov | Matthias Plappert | Jerry Tworek | Qiming Yuan | J. Leike | Mikhail Pavlov | Mohammad Bavarian | M. Knight | Yura Burda | Nicholas Joseph | Jared Kaplan | Alethea Power | Elizabeth Barnes | Vedant Misra | S. Balaji | Shantanu Jain | A. Carr | Pamela Mishkin | Heidy Khlaaf | Philippe Tillet | Henrique Ponde | D. Cummings | Fotios Chantzis | Evan Morikawa | Mira Murati | Katie Mayer | S. Gray | Igor Babuschkin | I. Sutskever

[1]  Dawn Xiaodong Song,et al.  Improving Neural Program Synthesis with Inferred Execution Traces , 2018, NeurIPS.

[2]  Eric Masanet,et al.  Recalibrating global data center energy-use estimates , 2020, Science.

[3]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[4]  Virginia Gewin,et al.  The technology trap , 2005, Nature.

[5]  Furu Wei,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ArXiv.

[6]  Graham Neubig,et al.  A Syntactic Neural Model for General-Purpose Code Generation , 2017, ACL.

[7]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[8]  Sumit Gulwani,et al.  Spreadsheet data manipulation using examples , 2012, CACM.

[9]  Sebastian Nowozin,et al.  DeepCoder: Learning to Write Programs , 2016, ICLR.

[10]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[11]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[12]  Tom Everitt,et al.  Alignment of Language Agents , 2021, ArXiv.

[13]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[14]  Stella Biderman,et al.  GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , 2021 .

[15]  Daniel Tarlow,et al.  Structured Generative Models of Natural Source Code , 2014, ICML.

[16]  James Zou,et al.  Persistent Anti-Muslim Bias in Large Language Models , 2021, AIES.

[17]  Micah Goldblum,et al.  Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses , 2020, ArXiv.

[18]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[19]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[20]  Michael O'Neill,et al.  Automatic programming: The open issue? , 2019, Genetic Programming and Evolvable Machines.

[21]  Zohar Manna,et al.  Toward automatic program synthesis , 1971, Symposium on Semantics of Algorithmic Languages.

[22]  Michael Meier,et al.  Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks , 2020, DIMVA.

[23]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[24]  Gabriele Bavota,et al.  An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation , 2018, ACM Trans. Softw. Eng. Methodol..

[25]  Rico Sennrich,et al.  A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation , 2017, IJCNLP.

[26]  Herbert A. Simon,et al.  Experiments with a Heuristic Compiler , 1963, JACM.

[27]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[28]  Chen Liang,et al.  Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[29]  Nal Kalchbrenner,et al.  Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling , 2018, ICLR.

[30]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[31]  George M. Church,et al.  Unified rational protein engineering with sequence-based deep representation learning , 2019, Nature Methods.

[32]  Lee Spector,et al.  General Program Synthesis Benchmark Suite , 2015, GECCO.

[33]  Michalis Faloutsos,et al.  SourceFinder: Finding Malware Source-Code from Publicly Available Repositories , 2020, RAID.

[34]  Nando de Freitas,et al.  Neural Programmer-Interpreters , 2015, ICLR.

[35]  Bernd Finkbeiner,et al.  Temporal Logics for Hyperproperties , 2013, POST.

[36]  Igor Steinmacher,et al.  Women’s Participation in Open Source Software: A Survey of the Literature , 2021, ACM Trans. Softw. Eng. Methodol..

[37]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[38]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[39]  Bogdan Korel,et al.  Application of Dynamic Slicing in Program Debugging , 1997, AADEBUG.

[40]  Graham Neubig,et al.  In-IDE Code Generation from Natural Language: Promise and Challenges , 2021, ACM Trans. Softw. Eng. Methodol..

[41]  Joseph Robert Horgan,et al.  Fault localization using execution slices and dataflow tests , 1995, Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE'95.

[42]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[43]  Ming Zhou,et al.  CodeBLEU: a Method for Automatic Evaluation of Code Synthesis , 2020, ArXiv.

[44]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[45]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[46]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[47]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[48]  Daron Acemoglu,et al.  The wrong kind of AI? Artificial intelligence and the future of labour demand , 2019 .

[49]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[50]  Vitaly Shmatikov,et al.  You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion , 2020, USENIX Security Symposium.

[51]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[52]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[53]  Daron Acemoglu,et al.  Robots and Jobs: Evidence from US Labor Markets , 2017, Journal of Political Economy.

[54]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[55]  Andrew Begel,et al.  What distinguishes great software engineers? , 2019, Empirical Software Engineering.

[56]  Jonathan M. Smith,et al.  USENIX Association , 2000 .

[57]  Ilya Sutskever,et al.  Jukebox: A Generative Model for Music , 2020, ArXiv.

[58]  Guillaume Lample,et al.  Unsupervised Translation of Programming Languages , 2020, NeurIPS.

[59]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[60]  Dan Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[61]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[62]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[63]  Kate Crawford,et al.  Atlas of AI , 2021, Perspectives on Science and Christian Faith.

[64]  Rajiv Gupta,et al.  BugFix: A learning-based tool to assist developers in fixing bugs , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[65]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[66]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[67]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[68]  Yejin Choi,et al.  Multimodal Neural Script Knowledge Models , 2021 .

[69]  Claire Le Goues,et al.  A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[70]  Capers Jones,et al.  The Economics of Software Quality , 2011 .

[71]  John R. Koza Genetic Programming III - Darwinian Invention and Problem Solving , 1999, Evolutionary Computation.

[72]  Neel Sundaresan,et al.  Generating bug-fixes using pretrained transformers , 2021, MAPS@PLDI.

[73]  Alfred Menezes,et al.  Handbook of Applied Cryptography , 2018 .

[74]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[75]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[76]  Wang Ling,et al.  Latent Predictor Networks for Code Generation , 2016, ACL.

[77]  Alec Radford,et al.  Learning to summarize from human feedback , 2020, NeurIPS.

[78]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[79]  P. Nordin Genetic Programming III - Darwinian Invention and Problem Solving , 1999 .

[80]  Fan Long,et al.  An analysis of patch plausibility and correctness for generate-and-validate patch generation systems , 2015, ISSTA.

[81]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[82]  Neel Sundaresan,et al.  Unit Test Case Generation with Transformers and Focal Context , 2020, 2009.05617.

[83]  Lee Spector,et al.  On the difficulty of benchmarking inductive program synthesis methods , 2017, GECCO.

[84]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[85]  Ion Stoica,et al.  Contrastive Code Representation Learning , 2021, EMNLP.

[86]  Neel Sundaresan,et al.  PyMT5: Multi-mode Translation of Natural Language and Python Code with Transformers , 2020, EMNLP.

[87]  Dawn Song,et al.  Measuring Coding Challenge Competence With APPS , 2021, NeurIPS Datasets and Benchmarks.

[88]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[89]  Pushmeet Kohli,et al.  RobustFill: Neural Program Learning under Noisy I/O , 2017, ICML.

[90]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.