Mistral 7B

We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.

[1]  Joseph E. Gonzalez,et al.  Efficient Memory Management for Large Language Model Serving with PagedAttention , 2023, SOSP.

[2]  Manish P Bhatt,et al.  Code Llama: Open Foundation Models for Code , 2023, ArXiv.

[3]  Eric Michael Smith,et al.  Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.

[4]  Michiel de Jong,et al.  GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , 2023, EMNLP.

[5]  Weizhu Chen,et al.  AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models , 2023, NAACL-HLT.

[6]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[7]  Quoc V. Le,et al.  Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , 2022, ACL.

[8]  Daniel Y. Fu,et al.  FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.

[9]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[10]  Charles Sutton,et al.  Program Synthesis with Large Language Models , 2021, ArXiv.

[11]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[12]  Dawn Song,et al.  Measuring Mathematical Problem Solving With the MATH Dataset , 2021, NeurIPS Datasets and Benchmarks.

[13]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[14]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[15]  Yejin Choi,et al.  PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[16]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[17]  Ming-Wei Chang,et al.  BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , 2019, NAACL.

[18]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[19]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[20]  Eunsol Choi,et al.  QuAC: Question Answering in Context , 2018, EMNLP.

[21]  Peter Clark,et al.  Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[22]  Oren Etzioni,et al.  Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[25]  Lisa Anne Hendricks,et al.  An empirical analysis of compute-optimal large language model training , 2022, NeurIPS.

[26]  Yejin Choi,et al.  An Adversarial Winograd Schema Challenge at Scale , 2019 .

[27]  Jonathan Berant,et al.  CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.