What Algorithms can Transformers Learn? A Study in Length Generalization

Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. This raises the question of if and when Transformer models can learn the true algorithm for solving a task. We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks. Here, we propose a unifying framework to understand when and how Transformers can exhibit strong length generalization on a given task. Specifically, we leverage RASP (Weiss et al., 2021) -- a programming language designed for the computational model of a Transformer -- and introduce the RASP-Generalization Conjecture: Transformers tend to length generalize on a task if the task can be solved by a short RASP program which works for all input lengths. This simple conjecture remarkably captures most known instances of length generalization on algorithmic tasks. Moreover, we leverage our insights to drastically improve generalization performance on traditionally hard tasks (such as parity and addition). On the theoretical side, we give a simple example where the"min-degree-interpolator"model of learning from Abbe et al. (2023) does not correctly predict Transformers' out-of-distribution behavior, but our conjecture does. Overall, our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.

[1]  Pranjal Awasthi,et al.  Improving Length-Generalization in Transformers via Task Hinting , 2023, arXiv.org.

[2]  Eran Malach Auto-Regressive Next-Token Predictors are Universal Learners , 2023, ArXiv.

[3]  Boyuan Chen,et al.  Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks , 2023, ArXiv.

[4]  Max Tegmark,et al.  The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks , 2023, arXiv.org.

[5]  Siva Reddy,et al.  The Impact of Positional Encoding on Length Generalization in Transformers , 2023, NeurIPS.

[6]  Ronan Le Bras,et al.  Faith and Fate: Limits of Transformers on Compositionality , 2023, NeurIPS.

[7]  Mehdi Abbana Bennani,et al.  Randomized Positional Encodings Boost Length Generalization of Transformers , 2023, ACL.

[8]  Seyed Mehran Kazemi,et al.  Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples , 2023, NeurIPS.

[9]  Michael Hanna,et al.  How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model , 2023, ArXiv.

[10]  J. Steinhardt,et al.  Progress measures for grokking via mechanistic interpretability , 2023, ICLR.

[11]  Tom McGrath,et al.  Tracr: Compiled Transformers as a Laboratory for Interpretability , 2023, NeurIPS.

[12]  D. Schuurmans,et al.  What learning algorithm is in-context learning? Investigations with linear models , 2022, ICLR.

[13]  P. Blunsom,et al.  Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions , 2022, ACL.

[14]  Noah A. Smith,et al.  Measuring and Narrowing the Compositionality Gap in Language Models , 2022, EMNLP.

[15]  Tom B. Brown,et al.  In-context Learning and Induction Heads , 2022, ArXiv.

[16]  Aman Madaan,et al.  Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango , 2022, ArXiv.

[17]  M. Shanahan,et al.  Faithful Reasoning Using Large Language Models , 2022, ArXiv.

[18]  Percy Liang,et al.  What Can Transformers Learn In-Context? A Case Study of Simple Function Classes , 2022, NeurIPS.

[19]  Yuhuai Wu,et al.  Exploring Length Generalization in Large Language Models , 2022, NeurIPS.

[20]  Yuhuai Wu,et al.  Solving Quantitative Reasoning Problems with Language Models , 2022, NeurIPS.

[21]  D. Schuurmans,et al.  Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , 2022, ICLR.

[22]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[23]  Peter A. Cholak,et al.  Overcoming a Theoretical Limitation of Self-Attention , 2022, ACL.

[24]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[25]  Franccois Charton Linear algebra with transformers , 2021, Trans. Mach. Learn. Res..

[26]  David Bieber,et al.  Show Your Work: Scratchpads for Intermediate Computation with Language Models , 2021, ArXiv.

[27]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[28]  Yejin Choi,et al.  Symbolic Brittleness in Sequence Models: on Systematic Generalization in Symbolic Mathematics , 2021, AAAI.

[29]  Noah A. Smith,et al.  Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , 2021, ICLR.

[30]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[31]  Noah A. Smith,et al.  Saturated Transformers are Constant-Depth Threshold Circuits , 2021, Transactions of the Association for Computational Linguistics.

[32]  Eran Yahav,et al.  Thinking Like Transformers , 2021, ICML.

[33]  Charles Blundell,et al.  Neural algorithmic reasoning , 2021, Patterns.

[34]  Rodrigo Nogueira,et al.  Investigating the Limitations of Transformers with Simple Arithmetic Tasks , 2021, 2102.13019.

[35]  Wei Zhang,et al.  How Can Self-Attention Networks Recognize Dyck-n Languages? , 2020, FINDINGS.

[36]  Navin Goyal,et al.  On the Ability and Limitations of Transformers to Recognize Formal Languages , 2020, EMNLP.

[37]  Marc van Zee,et al.  Compositional Generalization in Semantic Parsing: Pre-training vs. Specialized Architectures , 2020, ArXiv.

[38]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[39]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[40]  Michael Hahn,et al.  Theoretical Limitations of Self-Attention in Neural Sequence Models , 2019, TACL.

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  Lukasz Kaiser,et al.  Neural GPUs Learn Algorithms , 2015, ICLR.

[43]  Sanjeev Arora,et al.  Computational Complexity: A Modern Approach , 2009 .

[44]  Generalization , 1984 .

[45]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[46]  James L. McClelland,et al.  Representations and Computations in Transformers that Support Generalization on Structured Tasks , 2023 .

[47]  P. Barceló,et al.  Attention is Turing-Complete , 2021, J. Mach. Learn. Res..

[48]  Sung-Hyon Myaeng,et al.  Have You Seen That Number? Investigating Extrapolation in Question Answering Models , 2021, EMNLP.

[49]  G. Eijk Algorithmic reasoning , 2020 .

[50]  S. Shalev-Shwartz,et al.  Understanding Machine Learning - From Theory to Algorithms , 2014 .

[51]  A. Shiryayev On Tables of Random Numbers , 1993 .