Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions

Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer’s effective generalization performance despite relatively limited expressiveness.

[1]  S. Kakade,et al.  Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit , 2022, NeurIPS.

[2]  Yuhuai Wu,et al.  Exploring Length Generalization in Large Language Models , 2022, NeurIPS.

[3]  Pedro A. Ortega,et al.  Neural Networks and the Chomsky Hierarchy , 2022, ICLR.

[4]  Ashish Sabharwal,et al.  What Makes Instruction Learning Hard? An Investigation and a New Challenge in a Synthetic Environment , 2022, EMNLP.

[5]  D. Angluin,et al.  Formal Language Recognition by Hard Attention Transformers: Perspectives from Circuit Complexity , 2022, Transactions of the Association for Computational Linguistics.

[6]  Peter A. Cholak,et al.  Overcoming a Theoretical Limitation of Self-Attention , 2022, ACL.

[7]  Yuri Burda,et al.  Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , 2022, ArXiv.

[8]  Benjamin L. Edelman,et al.  Inductive Biases and Variable Creation in Self-Attention Mechanisms , 2021, ICML.

[9]  Noah A. Smith,et al.  Saturated Transformers are Constant-Depth Threshold Circuits , 2021, Transactions of the Association for Computational Linguistics.

[10]  Uzi Vishkin,et al.  Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks , 2021, NeurIPS.

[11]  C. Papadimitriou,et al.  Self-Attention Networks Can Process Bounded Hierarchical Languages , 2021, ACL.

[12]  Dan Jurafsky,et al.  Sensitivity as a Complexity Measure for Sequence Classification Tasks , 2021, Transactions of the Association for Computational Linguistics.

[13]  Marek Rei,et al.  Memorisation versus Generalisation in Pre-trained Language Models , 2021, ACL.

[14]  Samy Bengio,et al.  Understanding deep learning (still) requires rethinking generalization , 2021, Commun. ACM.

[15]  Navin Goyal,et al.  On the Practical Ability of Recurrent Neural Networks to Recognize Hierarchical Languages , 2020, COLING.

[16]  Wei Zhang,et al.  How Can Self-Attention Networks Recognize Dyck-n Languages? , 2020, FINDINGS.

[17]  Navin Goyal,et al.  On the Ability and Limitations of Transformers to Recognize Formal Languages , 2020, EMNLP.

[18]  Guillermo Valle Pérez,et al.  Is SGD a Bayesian sampler? Well, almost , 2020, J. Mach. Learn. Res..

[19]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[20]  Andrew Tomkins,et al.  Choppy: Cut Transformer for Ranked List Truncation , 2020, SIGIR.

[21]  Alexander J. Smola,et al.  TraDE: Transformers for Density Estimation , 2020, ArXiv.

[22]  Noah A. Smith,et al.  A Formal Hierarchy of RNN Architectures , 2020, ACL.

[23]  Pavel Izmailov,et al.  Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.

[24]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[25]  Guillermo Valle Pérez,et al.  Neural networks are a priori biased towards Boolean functions with low entropy , 2019, ArXiv.

[26]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[27]  Michael Hahn,et al.  Theoretical Limitations of Self-Attention in Neural Sequence Models , 2019, TACL.

[28]  Samuel A. Korsky,et al.  On the Computational Power of RNNs , 2019, ArXiv.

[29]  Zohar Ringel,et al.  Learning Curves for Deep Neural Networks: A Gaussian Field Theory Perspective , 2019, ArXiv.

[30]  Yonatan Belinkov,et al.  LSTM Networks Can Perform Dynamic Counting , 2019, Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges.

[31]  Fred Zhang,et al.  SGD on Neural Networks Learns Functions of Increasing Complexity , 2019, NeurIPS.

[32]  Seth Lloyd,et al.  Deep neural networks are biased towards simple functions , 2018, ArXiv.

[33]  Samet Oymak,et al.  Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? , 2018, ICML.

[34]  Yonatan Belinkov,et al.  On Evaluating the Generalization of LSTM Models in Formal Languages , 2018, ArXiv.

[35]  Dietrich Klakow,et al.  Closing Brackets with Recurrent Neural Networks , 2018, BlackboxNLP@EMNLP.

[36]  Robert C. Berwick,et al.  Evaluating the Ability of LSTMs to Learn Context-Free Grammars , 2018, BlackboxNLP@EMNLP.

[37]  Alexander M. Rush,et al.  The Annotated Transformer , 2018 .

[38]  Yoshua Bengio,et al.  On the Spectral Bias of Neural Networks , 2018, ICML.

[39]  Chico Q. Camargo,et al.  Deep learning generalizes because the parameter-function map is biased towards simple functions , 2018, ICLR.

[40]  Eran Yahav,et al.  On the Practical Computational Power of Finite Precision RNNs for Language Recognition , 2018, ACL.

[41]  Jascha Sohl-Dickstein,et al.  Sensitivity and Generalization in Neural Networks: an Empirical Study , 2018, ICLR.

[42]  Xue Liu,et al.  A Comparative Study of Rule Extraction for Recurrent Neural Networks , 2018, 1801.05420.

[43]  Jeffrey Pennington,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[44]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[45]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[46]  Andy R. Terrel,et al.  SymPy: Symbolic computing in Python , 2017, PeerJ Prepr..

[47]  Rocco A. Servedio,et al.  Smooth Boolean Functions are Easy: Efficient Algorithms for Low-Sensitivity Functions , 2015, ITCS.

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  Andris Ambainis,et al.  Tighter Relations between Sensitivity and Other Complexity Measures , 2014, ICALP.

[50]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[51]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[52]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[53]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[54]  Leonardo Franco,et al.  Generalization ability of Boolean functions implemented in feedforward neural networks , 2006, Neurocomputing.

[55]  Jürgen Schmidhuber,et al.  LSTM recurrent networks learn simple context-free and context-sensitive languages , 2001, IEEE Trans. Neural Networks.

[56]  John F. Kolen,et al.  Field Guide to Dynamical Recurrent Networks , 2001 .

[57]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[58]  Srimat T. Chakradhar,et al.  First-order versus second-order single-layer recurrent neural networks , 1994, IEEE Trans. Neural Networks.

[59]  Michael Kearns,et al.  Efficient noise-tolerant learning from statistical queries , 1993, STOC.

[60]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[61]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[62]  Nathan Linial,et al.  The influence of variables on Boolean functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[63]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[64]  T. Sanders,et al.  Analysis of Boolean Functions , 2012, ArXiv.

[65]  Brendan T. O'Connor,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics , 2011 .