Exploring Length Generalization in Large Language Models

The ability to extrapolate from short problem instances to longer ones is an im-portant form of out-of-distribution generalization in reasoning tasks, and is crucial when learning from datasets where longer problem instances are rare. These include theorem proving, solving quantitative mathematics problems, and reading/-summarizing novels. In this paper, we run careful empirical studies exploring the length generalization capabilities of transformer-based language models. We first establish that naively finetuning transformers on length generalization tasks shows significant generalization deficiencies independent of model scale. We then show that combining pretrained large language models’ in-context learning abilities with scratchpad prompting (asking the model to output solution steps before producing an answer) results in a dramatic improvement in length generalization. We run careful failure analyses on each of the learning modalities and identify common sources of mistakes that highlight opportunities in equipping language models with the ability to generalize to longer problems.

[1]  Yuhuai Wu,et al.  Solving Quantitative Reasoning Problems with Language Models , 2022, NeurIPS.

[2]  Sébastien Bubeck,et al.  Unveiling Transformers with LEGO: a synthetic reasoning task , 2022, ArXiv.

[3]  Matt Gardner,et al.  Impact of Pretraining Term Frequencies on Few-Shot Reasoning , 2022, ArXiv.

[4]  Furong Huang,et al.  End-to-end Algorithm Synthesis with Recurrent Networks: Logical Extrapolation Without Overthinking , 2022, ArXiv.

[5]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, ArXiv.

[6]  Renelito Delos Santos,et al.  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[7]  Noah A. Smith,et al.  Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , 2021, ICLR.

[8]  David Bieber,et al.  Show Your Work: Scratchpads for Intermediate Computation with Language Models , 2021, ArXiv.

[9]  Charles Sutton,et al.  Program Synthesis with Large Language Models , 2021, ArXiv.

[10]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[11]  Uzi Vishkin,et al.  Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks , 2021, NeurIPS.

[12]  Jason Weston,et al.  Staircase Attention for Recurrent Processing of Sequences , 2021, NeurIPS.

[13]  Dawn Song,et al.  Measuring Mathematical Problem Solving With the MATH Dataset , 2021, NeurIPS Datasets and Benchmarks.

[14]  Behnam Neyshabur,et al.  Understanding the Failure Modes of Out-of-Distribution Generalization , 2020, ICLR.

[15]  Eli A. Meirom,et al.  From Local Structures to Size Generalization in Graph Neural Networks , 2020, ICML.

[16]  Jimmy Ba,et al.  INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving , 2020, ICLR.

[17]  E. Kharitonov,et al.  What they do when in doubt: a study of inductive biases in seq2seq learners , 2020, ICLR.

[18]  Benjamin Newman,et al.  The EOS Decision and Length Extrapolation , 2020, BLACKBOXNLP.

[19]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[20]  Oyvind Tafjord,et al.  Transformers as Soft Reasoners over Language , 2020, IJCAI.

[21]  R. Thomas McCoy,et al.  Does Syntax Need to Grow on Trees? Sources of Hierarchical Inductive Bias in Sequence-to-Sequence Networks , 2020, TACL.

[22]  Elia Bruni,et al.  Location Attention for Extrapolation to Longer Sequences , 2019, ACL.

[23]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[24]  Rajarshi Das,et al.  Do Multi-hop Readers Dream of Reasoning Chains? , 2019, EMNLP.

[25]  Haohan Wang,et al.  Unlearn Dataset Bias in Natural Language Inference by Fitting the Residual , 2019, EMNLP.

[26]  Yonatan Belinkov,et al.  LSTM Networks Can Perform Dynamic Counting , 2019, Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges.

[27]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[28]  Yee Whye Teh,et al.  Set Transformer , 2018, ICML.

[29]  David Chiang,et al.  Correcting Length Bias in Neural Machine Translation , 2018, WMT.

[30]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[31]  Lukasz Kaiser,et al.  Neural GPUs Learn Algorithms , 2015, ICLR.

[32]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..