OctoPack: Instruction Tuning Code Large Language Models

Finetuning large language models (LLMs) on instructions leads to vast performance improvements on natural language tasks. We apply instruction tuning using code, leveraging the natural structure of Git commits, which pair code changes with human instructions. We compile CommitPack: 4 terabytes of Git commits across 350 programming languages. We benchmark CommitPack against other natural and synthetic code instructions (xP3x, Self-Instruct, OASST) on the 16B parameter StarCoder model, and achieve state-of-the-art performance among models not trained on OpenAI outputs, on the HumanEval Python benchmark (46.2% pass@1). We further introduce HumanEvalPack, expanding the HumanEval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, JavaScript, Java, Go, C++, Rust). Our models, OctoCoder and OctoGeeX, achieve the best performance across HumanEvalPack among all permissive models, demonstrating CommitPack's benefits in generalizing to a wider set of languages and natural coding tasks. Code, models and data are freely available at https://github.com/bigcode-project/octopack.

[1]  Yiling Lou,et al.  ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation , 2023, ArXiv.

[2]  Jiaxin Zhang,et al.  PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback , 2023, ArXiv.

[3]  Eric Michael Smith,et al.  Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.

[4]  James Y. Zou,et al.  How is ChatGPT's behavior changing over time? , 2023, ArXiv.

[5]  Nelson F. Liu,et al.  Lost in the Middle: How Language Models Use Long Contexts , 2023, TACL.

[6]  Alham Fikri Aji,et al.  Style Over Substance: Evaluation Biases for Large Language Models , 2023, ArXiv.

[7]  Yan Zeng,et al.  What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? , 2023, ArXiv.

[8]  Yew Ken Chia,et al.  Flacuna: Unleashing the Problem Solving Power of Vicuna using FLAN Fine-Tuning , 2023, arXiv.org.

[9]  Carolyn Jane Anderson,et al.  MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation , 2023, IEEE Transactions on Software Engineering.

[10]  Shouyuan Chen,et al.  Extending Context Window of Large Language Models via Positional Interpolation , 2023, ArXiv.

[11]  Harm de Vries,et al.  RepoFusion: Training Code Models to Understand Your Repository , 2023, ArXiv.

[12]  Can Xu,et al.  WizardCoder: Empowering Code Large Language Models with Evol-Instruct , 2023, ArXiv.

[13]  Khyathi Raghavi Chandu,et al.  How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources , 2023, ArXiv.

[14]  Carolyn Jane Anderson,et al.  StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code , 2023, ArXiv.

[15]  Julian McAuley,et al.  RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , 2023, ArXiv.

[16]  Han Zhang,et al.  Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding , 2023, ArXiv.

[17]  Nghi D. Q. Bui,et al.  CodeTF: One-stop Transformer Library for State-of-the-art Code LLM , 2023, ArXiv.

[18]  Zhouchen Lin,et al.  Code Prompting: a Neural Symbolic Method for Complex Reasoning in Large Language Models , 2023, ArXiv.

[19]  Jiayi Wei,et al.  Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing , 2023, ArXiv.

[20]  Alexander M. Rush,et al.  Scaling Data-Constrained Language Models , 2023, ArXiv.

[21]  S. Levine,et al.  The False Promise of Imitating Proprietary LLMs , 2023, ArXiv.

[22]  Luke Zettlemoyer,et al.  QLoRA: Efficient Finetuning of Quantized LLMs , 2023, NeurIPS.

[23]  Quentin G. Anthony,et al.  RWKV: Reinventing RNNs for the Transformer Era , 2023, EMNLP.

[24]  Barret Zoph,et al.  A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity , 2023, NAACL.

[25]  Weizhu Chen,et al.  CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing , 2023, ArXiv.

[26]  Omer Levy,et al.  LIMA: Less Is More for Alignment , 2023, NeurIPS.

[27]  Dongyeop Kang,et al.  CoEdIT: Text Editing by Task-Specific Instruction Tuning , 2023, ArXiv.

[28]  Nghi D. Q. Bui,et al.  CodeT5+: Open Code Large Language Models for Code Understanding and Generation , 2023, EMNLP.

[29]  Harm de Vries,et al.  StarCoder: may the source be with you! , 2023, ArXiv.

[30]  Yuanhan Zhang,et al.  Otter: A Multi-Modal Model with In-Context Instruction Tuning , 2023, ArXiv.

[31]  S. Savarese,et al.  CodeGen2: Lessons for Training LLMs on Programming and Natural Languages , 2023, ArXiv.

[32]  Lingming Zhang,et al.  Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation , 2023, ArXiv.

[33]  Terry Yue Zhuo Large Language Models Are State-of-the-Art Evaluators of Code Generation , 2023, ArXiv.

[34]  Can Xu,et al.  WizardLM: Empowering Large Language Models to Follow Complex Instructions , 2023, ArXiv.

[35]  Luiza Amador Pozzobon,et al.  On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research , 2023, EMNLP.

[36]  Stella Rose Biderman,et al.  Emergent and Predictable Memorization in Large Language Models , 2023, NeurIPS.

[37]  Yong Jae Lee,et al.  Visual Instruction Tuning , 2023, ArXiv.

[38]  Ge Li,et al.  Self-collaboration Code Generation via ChatGPT , 2023, ACM Transactions on Software Engineering and Methodology.

[39]  Zhi Rui Tam,et al.  OpenAssistant Conversations - Democratizing Large Language Model Alignment , 2023, ArXiv.

[40]  Xinyun Chen,et al.  Teaching Large Language Models to Self-Debug , 2023, ArXiv.

[41]  Shuvendu K. Lahiri,et al.  Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions , 2023, ArXiv.

[42]  Julian Aron Prenner,et al.  RunBugRun - An Executable Dataset for Automated Program Repair , 2023, ArXiv.

[43]  Oskar van der Wal,et al.  Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , 2023, ICML.

[44]  Zhilin Yang,et al.  CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X , 2023, ArXiv.

[45]  Bodhisattwa Prasad Majumder,et al.  Self-Refine: Iterative Refinement with Self-Feedback , 2023, NeurIPS.

[46]  Dan Iter,et al.  G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , 2023, EMNLP.

[47]  L. B. Kristensen,et al.  Errors are Useful Prompts: Instruction Guided Task Programming with Verifier-Assisted Iterative Prompting , 2023, ArXiv.

[48]  Yi Mao,et al.  RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation , 2023, ArXiv.

[49]  Marco Tulio Ribeiro,et al.  Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[50]  David Ifeoluwa Adelani,et al.  The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset , 2023, NeurIPS.

[51]  Dragomir R. Radev,et al.  LEVER: Learning to Verify Language-to-Code Generation with Execution , 2023, ICML.

[52]  Graham Neubig,et al.  Learning Performance-Improving Code Edits , 2023, ArXiv.

[53]  Nan Jiang,et al.  Impact of Code Language Models on Automated Program Repair , 2023, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

[54]  Sumit Agarwal,et al.  CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code , 2023, ArXiv.

[55]  Pengfei Liu,et al.  GPTScore: Evaluate as You Desire , 2023, NAACL.

[56]  J. Malmaud,et al.  Measuring The Impact Of Programming Language Distribution , 2023, ICML.

[57]  Tao Xie,et al.  CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models , 2023, ArXiv.

[58]  Quoc V. Le,et al.  The Flan Collection: Designing Data and Methods for Effective Instruction Tuning , 2023, ICML.

[59]  Lingming Zhang,et al.  Conversational Automated Program Repair , 2023, ArXiv.

[60]  J. Petke,et al.  An Analysis of the Automatic Bug Fixing Performance of ChatGPT , 2023, 2023 IEEE/ACM International Workshop on Automated Program Repair (APR).

[61]  Harm de Vries,et al.  SantaCoder: don't reach for the stars! , 2023, ArXiv.

[62]  Xi Victoria Lin,et al.  OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization , 2022, ArXiv.

[63]  Ying Shen,et al.  MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning , 2022, ACL.

[64]  Noah A. Smith,et al.  Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.

[65]  Graham Neubig,et al.  Execution-Based Evaluation for Open-Domain Code Generation , 2022, EMNLP.

[66]  Wasi Uddin Ahmad,et al.  CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file Context , 2022, ArXiv.

[67]  Xi Victoria Lin,et al.  Training Trajectories of Language Models Across Scales , 2022, ACL.

[68]  Dragomir R. Radev,et al.  BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting , 2022, ACL.

[69]  Sida I. Wang,et al.  Coder Reviewer Reranking for Code Generation , 2022, ICML.

[70]  Todd Mytkowicz,et al.  CodeExp: Explanatory Code Document Generation , 2022, Conference on Empirical Methods in Natural Language Processing.

[71]  Harm de Vries,et al.  The Stack: 3 TB of permissively licensed source code , 2022, ArXiv.

[72]  Jamie Callan,et al.  PAL: Program-aided Language Models , 2022, ICML.

[73]  Luke Zettlemoyer,et al.  DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation , 2022, ICML.

[74]  Guillem Cucurull,et al.  Galactica: A Large Language Model for Science , 2022, ArXiv.

[75]  Alexander M. Rush,et al.  BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.

[76]  E. Tuzun,et al.  Assessing the quality of GitHub copilot’s code generation , 2022, PROMISE.

[77]  Jimmy Ba,et al.  Large Language Models Are Human-Level Prompt Engineers , 2022, ICLR.

[78]  Sujan Kumar Gonugondla,et al.  Multi-lingual Evaluation of Code Generation Models , 2022, ICLR.

[79]  Zheng Xin Yong,et al.  What Language Model to Train if You Have One Million GPU Hours? , 2022, EMNLP.

[80]  Andrew M. Dai,et al.  Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[81]  Nils Reimers,et al.  MTEB: Massive Text Embedding Benchmark , 2022, EACL.

[82]  Xiaofei Xie,et al.  TransRepair: Context-aware Program Repair for Compilation Errors , 2022, ASE.

[83]  Sumit Gulwani,et al.  Repairing Bugs in Python Assignments Using Large Language Models , 2022, ArXiv.

[84]  Edouard Grave,et al.  EditEval: An Instruction-Based Benchmark for Text Improvements , 2022, ArXiv.

[85]  Edouard Grave,et al.  PEER: A Collaborative Language Model , 2022, ICLR.

[86]  Tom B. Brown,et al.  Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , 2022, ArXiv.

[87]  Junyi Jessy Li,et al.  CoditT5: Pretraining for Source Code and Natural Language Editing , 2022, International Conference on Automated Software Engineering.

[88]  J. Schulman,et al.  Efficient Training of Language Models to Fill in the Middle , 2022, ArXiv.

[89]  Fenia Christopoulou,et al.  PanGu-Coder: Program Synthesis with Function-Level Language Modeling , 2022, ArXiv.

[90]  Weizhu Chen,et al.  CodeT: Code Generation with Generated Tests , 2022, ICLR.

[91]  H. Larochelle,et al.  Repository-Level Prompt Generation for Large Language Models of Code , 2022, ICML.

[92]  Kenneth O. Stanley,et al.  Evolution through Large Models , 2022, 2206.08896.

[93]  Chandan K. Reddy,et al.  XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence , 2022, ArXiv.

[94]  Hanghang Tong,et al.  Combining Code Context and Fine-grained Code Difference for Commit Message Generation , 2022, Internetware.

[95]  Gerard de Melo,et al.  Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[96]  Daniel Y. Fu,et al.  FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.

[97]  Graham Neubig,et al.  Learning to Model Editing Processes , 2022, EMNLP.

[98]  A. Eghbali,et al.  CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code , 2022, 2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).

[99]  Martin T. Vechev,et al.  On Distribution Shift in Learning-based Bug Detectors , 2022, ICML.

[100]  Noah A. Smith,et al.  Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks , 2022, EMNLP.

[101]  Stella Rose Biderman,et al.  GPT-NeoX-20B: An Open-Source Autoregressive Language Model , 2022, BIGSCIENCE.

[102]  Sida I. Wang,et al.  InCoder: A Generative Model for Code Infilling and Synthesis , 2022, ICLR.

[103]  Tom B. Brown,et al.  Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , 2022, ArXiv.

[104]  S. Savarese,et al.  CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , 2022, ICLR.

[105]  D. Schuurmans,et al.  Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ICLR.

[106]  Frank F. Xu,et al.  MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages , 2022, FINDINGS.

[107]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[108]  Frank F. Xu,et al.  A systematic evaluation of large language models of code , 2022, MAPS@PLDI.

[109]  Junchao Wang,et al.  A Survey of Automatic Source Code Summarization , 2022, Symmetry.

[110]  Niklas Muennighoff SGPT: GPT Sentence Embeddings for Semantic Search , 2022, ArXiv.

[111]  Cherepanov,et al.  Competition-level code generation with AlphaCode , 2022, Science.

[112]  Gerard de Melo,et al.  NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation , 2021, Northern European Journal of Language Technology.

[113]  David Bieber,et al.  Show Your Work: Scratchpads for Intermediate Computation with Language Models , 2021, ArXiv.

[114]  Romain Robbes,et al.  Automatic Program Repair with OpenAI's Codex: Evaluating QuixBugs , 2021, ArXiv.

[115]  M. Lewis,et al.  MetaICL: Learning to Learn In Context , 2021, NAACL.

[116]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[117]  Mohit Iyyer,et al.  Do Long-Range Language Models Actually Use Long-Range Context? , 2021, EMNLP.

[118]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[119]  Noah A. Smith,et al.  Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , 2021, ICLR.

[120]  Charles Sutton,et al.  Program Synthesis with Large Language Models , 2021, ArXiv.

[121]  Zhongxing Yu,et al.  Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size , 2021, ArXiv.

[122]  Hongyu Zhang,et al.  On the Evaluation of Neural Code Summarization , 2021, 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE).

[123]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[124]  Percy Liang,et al.  Break-It-Fix-It: Unsupervised Learning for Program Repair , 2021, ICML.

[125]  Tae-Hwan Jung,et al.  CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model , 2021, NLP4PROG.

[126]  Douwe Kiela,et al.  True Few-Shot Learning with Language Models , 2021, NeurIPS.

[127]  Dawn Song,et al.  Measuring Coding Challenge Competence With APPS , 2021, NeurIPS Datasets and Benchmarks.

[128]  Neel Sundaresan,et al.  DeepDebug: Fixing Python Bugs Using Stack Traces, Backtranslation, and Code Skeletons , 2021, ArXiv.

[129]  Stella Biderman,et al.  GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , 2021 .

[130]  Neel Sundaresan,et al.  CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation , 2021, NeurIPS Datasets and Benchmarks.

[131]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[132]  Baishakhi Ray,et al.  A Transformer-based Approach for Source Code Summarization , 2020, ACL.

[133]  Rishabh Singh,et al.  Global Relational Models of Source Code , 2020, ICLR.

[134]  Noam Shazeer,et al.  Fast Transformer Decoding: One Write-Head is All You Need , 2019, ArXiv.

[135]  Satish Chandra,et al.  Neural Code Search Evaluation Dataset , 2019, ArXiv.

[136]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[137]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[138]  Ehud Reiter,et al.  A Structured Review of the Validity of BLEU , 2018, CL.

[139]  Graham Neubig,et al.  Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[140]  Matias Martinez,et al.  A Comprehensive Study of Automatic Program Repair on the QuixBugs Benchmark , 2018, 2019 IEEE 1st International Workshop on Intelligent Bug Fixing (IBF).

[141]  Armando Solar-Lezama,et al.  QuixBugs: a multi-lingual program repair benchmark set based on the quixey challenge , 2017, SPLASH.

[142]  Rico Sennrich,et al.  A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation , 2017, IJCNLP.

[143]  Natalie Schluter,et al.  The limits of automatic summarisation according to ROUGE , 2017, EACL.

[144]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[145]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[146]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[147]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[148]  Dragomir R. Radev,et al.  Crosslingual Generalization through Multitask Finetuning , 2023, ACL.

[149]  Shafiq R. Joty,et al.  xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval , 2023, ArXiv.

[150]  P. Pantel,et al.  The Hateful Memes Challenge: Competition Report , 2020, NeurIPS.

[151]  Jean Carletta,et al.  Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.