Code Llama: Open Foundation Models for Code

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 53% and 55% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.

[1]  Nicolas Papernot,et al.  LLM Censorship: A Machine Learning Challenge or a Computer Security Problem? , 2023, ArXiv.

[2]  Eric Michael Smith,et al.  Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.

[3]  Deheng Ye,et al.  RLTF: Reinforcement Learning from Unit Test Feedback , 2023, arXiv.org.

[4]  Nelson F. Liu,et al.  Lost in the Middle: How Language Models Use Long Contexts , 2023, TACL.

[5]  Li Dong,et al.  LongNet: Scaling Transformers to 1, 000, 000, 000 Tokens , 2023, ArXiv.

[6]  Carolyn Jane Anderson,et al.  MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation , 2023, IEEE Transactions on Software Engineering.

[7]  Shouyuan Chen,et al.  Extending Context Window of Large Language Models via Positional Interpolation , 2023, ArXiv.

[8]  Julian McAuley,et al.  LongCoder: A Long-Range Pre-trained Language Model for Code Completion , 2023, ICML.

[9]  Harkirat Singh Behl,et al.  Textbooks Are All You Need , 2023, ArXiv.

[10]  Julien Launay,et al.  The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only , 2023, ArXiv.

[11]  Siva Reddy,et al.  The Impact of Positional Encoding on Length Generalization in Transformers , 2023, NeurIPS.

[12]  Yejin Choi,et al.  Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing , 2023, ArXiv.

[13]  M. Lewis,et al.  MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers , 2023, NeurIPS.

[14]  Harm de Vries,et al.  StarCoder: may the source be with you! , 2023, ArXiv.

[15]  S. Savarese,et al.  CodeGen2: Lessons for Training LLMs on Programming and Natural Languages , 2023, ArXiv.

[16]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[17]  Nikos Karampatziakis,et al.  Meet in the Middle: A New Pre-training Paradigm , 2023, ArXiv.

[18]  J. Tenenbaum,et al.  Planning with Large Language Models for Code Generation , 2023, ICLR.

[19]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[20]  Harm de Vries,et al.  SantaCoder: don't reach for the stars! , 2023, ArXiv.

[21]  Li Dong,et al.  A Length-Extrapolatable Transformer , 2022, ACL.

[22]  Yanqiao Zhu,et al.  A Survey on Pretrained Language Models for Neural Code Intelligence , 2022, ArXiv.

[23]  Omer Levy,et al.  Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor , 2022, ACL.

[24]  Alexander M. Rush,et al.  BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.

[25]  Sumit Gulwani,et al.  Repairing Bugs in Python Assignments Using Large Language Models , 2022, ArXiv.

[26]  J. Schulman,et al.  Efficient Training of Language Models to Fill in the Middle , 2022, ArXiv.

[27]  Weizhu Chen,et al.  CodeT: Code Generation with Generated Tests , 2022, ICLR.

[28]  Akhilesh Deepak Gotmare,et al.  CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning , 2022, NeurIPS.

[29]  Gabriel Synnaeve,et al.  Code Translation with Compiler Representations , 2022, ICLR.

[30]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[31]  Julian Aron Prenner,et al.  Can OpenAI's Codex Fix Bugs?: An evaluation on QuixBugs , 2022, 2022 IEEE/ACM International Workshop on Automated Program Repair (APR).

[32]  Stella Rose Biderman,et al.  GPT-NeoX-20B: An Open-Source Autoregressive Language Model , 2022, BIGSCIENCE.

[33]  Sida I. Wang,et al.  InCoder: A Generative Model for Code Infilling and Synthesis , 2022, ICLR.

[34]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[35]  Omer Levy,et al.  Transformer Language Models without Positional Encodings Still Learn Positional Information , 2022, EMNLP.

[36]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[37]  S. Savarese,et al.  CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , 2022, ICLR.

[38]  Dipankar Ray,et al.  ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection , 2022, ACL.

[39]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[40]  Cherepanov,et al.  Competition-level code generation with AlphaCode , 2022, Science.

[41]  Dmytro Okhonko,et al.  CM3: A Causal Masked Multimodal Model of the Internet , 2022, ArXiv.

[42]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[43]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[44]  Gabriel Synnaeve,et al.  Leveraging Automated Unit Tests for Unsupervised Code Translation , 2021, ICLR.

[45]  Owain Evans,et al.  TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.

[46]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[47]  Yue Wang,et al.  CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , 2021, EMNLP.

[48]  Noah A. Smith,et al.  Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , 2021, ICLR.

[49]  Charles Sutton,et al.  Program Synthesis with Large Language Models , 2021, ArXiv.

[50]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[51]  Percy Liang,et al.  Break-It-Fix-It: Unsupervised Learning for Program Repair , 2021, ICML.

[52]  Dawn Song,et al.  Measuring Coding Challenge Competence With APPS , 2021, NeurIPS Datasets and Benchmarks.

[53]  Jianlin Su,et al.  RoFormer: Enhanced Transformer with Rotary Position Embedding , 2021, Neurocomputing.

[54]  Stella Biderman,et al.  GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , 2021 .

[55]  Kai-Wei Chang,et al.  Unified Pre-training for Program Understanding and Generation , 2021, NAACL.

[56]  Guillaume Lample,et al.  DOBF: A Deobfuscation Pre-Training Objective for Programming Languages , 2021, NeurIPS.

[57]  Neel Sundaresan,et al.  CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation , 2021, NeurIPS Datasets and Benchmarks.

[58]  Kai-Wei Chang,et al.  BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation , 2021, FAccT.

[59]  Ming Zhou,et al.  GraphCodeBERT: Pre-training Code Representations with Data Flow , 2020, ICLR.

[60]  Neel Sundaresan,et al.  Unit Test Case Generation with Transformers , 2020, ArXiv.

[61]  Joseph E. Gonzalez,et al.  Contrastive Code Representation Learning , 2020, EMNLP.

[62]  Guillaume Lample,et al.  Unsupervised Translation of Programming Languages , 2020, NeurIPS.

[63]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[64]  Ting Liu,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, FINDINGS.

[65]  Andrew Rice,et al.  Learning to Fix Build Errors with Graph2Diff Neural Networks , 2019, ICSE.

[66]  Chris Quirk,et al.  Novel positional encodings to enable tree-based transformers , 2019, NeurIPS.

[67]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[68]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[69]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[70]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[71]  Miltiadis Allamanis,et al.  The adverse effects of code duplication in machine learning models of code , 2018, Onward!.

[72]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[73]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[74]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[75]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[76]  Alexandra Birch,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[77]  Eric Gilbert,et al.  VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text , 2014, ICWSM.

[78]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[79]  Ge Li,et al.  Integrating Tree Path in Transformer for Code Representation , 2021, NeurIPS.