Extending Context Window of Large Language Models via Positional Interpolation

We present Position Interpolation (PI) that extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps), while demonstrating strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B. Meanwhile, the extended model by Position Interpolation preserve quality relatively well on tasks within its original context window. To achieve this goal, Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism. Our theoretical study shows that the upper bound of interpolation is at least $\sim 600 \times$ smaller than that of extrapolation, further demonstrating its stability. Models extended via Position Interpolation retain its original architecture and can reuse most pre-existing optimization and infrastructure.

[1]  Martin Jaggi,et al.  Landmark Attention: Random-Access Infinite Context Length for Transformers , 2023, ArXiv.

[2]  Myle Ott,et al.  PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , 2023, ArXiv.

[3]  Noah D. Goodman,et al.  Learning to Compress Prompts with Gist Tokens , 2023, ArXiv.

[4]  David C. Uthus,et al.  CoLT5: Faster Long-Range Transformers with Conditional Computation , 2023, ArXiv.

[5]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[6]  Jamie Callan,et al.  Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer , 2022, EMNLP.

[7]  Jane A. Yu,et al.  Few-shot Learning with Retrieval Augmented Language Models , 2022, J. Mach. Learn. Res..

[8]  M. Burtsev,et al.  Recurrent Memory Transformer , 2022, NeurIPS.

[9]  Daniel Y. Fu,et al.  FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.

[10]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[11]  Omer Levy,et al.  Transformer Language Models without Positional Encodings Still Learn Positional Information , 2022, EMNLP.

[12]  Markus N. Rabe,et al.  Memorizing Transformers , 2022, ICLR.

[13]  Omer Levy,et al.  SCROLLS: Standardized CompaRison Over Long Language Sequences , 2022, EMNLP.

[14]  M. Zaharia,et al.  ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction , 2021, NAACL.

[15]  Noah A. Smith,et al.  Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , 2021, ICLR.

[16]  Jure Leskovec,et al.  Combiner: Full Attention Transformer with Sparse Computation Cost , 2021, NeurIPS.

[17]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[18]  Jianlin Su,et al.  RoFormer: Enhanced Transformer with Rotary Position Embedding , 2021, Neurocomputing.

[19]  Shuyang Cao,et al.  Efficient Attentions for Long Document Summarization , 2021, NAACL.

[20]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[21]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[22]  A. Geramifard,et al.  Memformer: A Memory-Augmented Transformer for Sequence Modeling , 2020, AACL/IJCNLP.

[23]  Lucy J. Colwell,et al.  Rethinking Attention with Performers , 2020, ICLR.

[24]  M. Zaheer,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[25]  Christopher Potts,et al.  Relevance-guided Supervision for OpenQA with ColBERT , 2020, Transactions of the Association for Computational Linguistics.

[26]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[27]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[28]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[29]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[30]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[31]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[32]  Timothy P. Lillicrap,et al.  Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[33]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[34]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[35]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[36]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[39]  André F. T. Martins,et al.  ∞-former: Infinite Memory Transformer , 2022, ACL.