Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions
暂无分享,去创建一个
[1] S. Kakade,et al. Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit , 2022, NeurIPS.
[2] Yuhuai Wu,et al. Exploring Length Generalization in Large Language Models , 2022, NeurIPS.
[3] Pedro A. Ortega,et al. Neural Networks and the Chomsky Hierarchy , 2022, ICLR.
[4] Ashish Sabharwal,et al. What Makes Instruction Learning Hard? An Investigation and a New Challenge in a Synthetic Environment , 2022, EMNLP.
[5] D. Angluin,et al. Formal Language Recognition by Hard Attention Transformers: Perspectives from Circuit Complexity , 2022, Transactions of the Association for Computational Linguistics.
[6] Peter A. Cholak,et al. Overcoming a Theoretical Limitation of Self-Attention , 2022, ACL.
[7] Yuri Burda,et al. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , 2022, ArXiv.
[8] Benjamin L. Edelman,et al. Inductive Biases and Variable Creation in Self-Attention Mechanisms , 2021, ICML.
[9] Noah A. Smith,et al. Saturated Transformers are Constant-Depth Threshold Circuits , 2021, Transactions of the Association for Computational Linguistics.
[10] Uzi Vishkin,et al. Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks , 2021, NeurIPS.
[11] C. Papadimitriou,et al. Self-Attention Networks Can Process Bounded Hierarchical Languages , 2021, ACL.
[12] Dan Jurafsky,et al. Sensitivity as a Complexity Measure for Sequence Classification Tasks , 2021, Transactions of the Association for Computational Linguistics.
[13] Marek Rei,et al. Memorisation versus Generalisation in Pre-trained Language Models , 2021, ACL.
[14] Samy Bengio,et al. Understanding deep learning (still) requires rethinking generalization , 2021, Commun. ACM.
[15] Navin Goyal,et al. On the Practical Ability of Recurrent Neural Networks to Recognize Hierarchical Languages , 2020, COLING.
[16] Wei Zhang,et al. How Can Self-Attention Networks Recognize Dyck-n Languages? , 2020, FINDINGS.
[17] Navin Goyal,et al. On the Ability and Limitations of Transformers to Recognize Formal Languages , 2020, EMNLP.
[18] Guillermo Valle Pérez,et al. Is SGD a Bayesian sampler? Well, almost , 2020, J. Mach. Learn. Res..
[19] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[20] Andrew Tomkins,et al. Choppy: Cut Transformer for Ranked List Truncation , 2020, SIGIR.
[21] Alexander J. Smola,et al. TraDE: Transformers for Density Estimation , 2020, ArXiv.
[22] Noah A. Smith,et al. A Formal Hierarchy of RNN Architectures , 2020, ACL.
[23] Pavel Izmailov,et al. Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.
[24] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[25] Guillermo Valle Pérez,et al. Neural networks are a priori biased towards Boolean functions with low entropy , 2019, ArXiv.
[26] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[27] Michael Hahn,et al. Theoretical Limitations of Self-Attention in Neural Sequence Models , 2019, TACL.
[28] Samuel A. Korsky,et al. On the Computational Power of RNNs , 2019, ArXiv.
[29] Zohar Ringel,et al. Learning Curves for Deep Neural Networks: A Gaussian Field Theory Perspective , 2019, ArXiv.
[30] Yonatan Belinkov,et al. LSTM Networks Can Perform Dynamic Counting , 2019, Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges.
[31] Fred Zhang,et al. SGD on Neural Networks Learns Functions of Increasing Complexity , 2019, NeurIPS.
[32] Seth Lloyd,et al. Deep neural networks are biased towards simple functions , 2018, ArXiv.
[33] Samet Oymak,et al. Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? , 2018, ICML.
[34] Yonatan Belinkov,et al. On Evaluating the Generalization of LSTM Models in Formal Languages , 2018, ArXiv.
[35] Dietrich Klakow,et al. Closing Brackets with Recurrent Neural Networks , 2018, BlackboxNLP@EMNLP.
[36] Robert C. Berwick,et al. Evaluating the Ability of LSTMs to Learn Context-Free Grammars , 2018, BlackboxNLP@EMNLP.
[37] Alexander M. Rush,et al. The Annotated Transformer , 2018 .
[38] Yoshua Bengio,et al. On the Spectral Bias of Neural Networks , 2018, ICML.
[39] Chico Q. Camargo,et al. Deep learning generalizes because the parameter-function map is biased towards simple functions , 2018, ICLR.
[40] Eran Yahav,et al. On the Practical Computational Power of Finite Precision RNNs for Language Recognition , 2018, ACL.
[41] Jascha Sohl-Dickstein,et al. Sensitivity and Generalization in Neural Networks: an Empirical Study , 2018, ICLR.
[42] Xue Liu,et al. A Comparative Study of Rule Extraction for Recurrent Neural Networks , 2018, 1801.05420.
[43] Jeffrey Pennington,et al. Deep Neural Networks as Gaussian Processes , 2017, ICLR.
[44] Yoshua Bengio,et al. A Closer Look at Memorization in Deep Networks , 2017, ICML.
[45] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[46] Andy R. Terrel,et al. SymPy: Symbolic computing in Python , 2017, PeerJ Prepr..
[47] Rocco A. Servedio,et al. Smooth Boolean Functions are Easy: Efficient Algorithms for Low-Sensitivity Functions , 2015, ITCS.
[48] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[49] Andris Ambainis,et al. Tighter Relations between Sensitivity and Other Complexity Measures , 2014, ICALP.
[50] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.
[51] Ameet Talwalkar,et al. Foundations of Machine Learning , 2012, Adaptive computation and machine learning.
[52] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.
[53] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.
[54] Leonardo Franco,et al. Generalization ability of Boolean functions implemented in feedforward neural networks , 2006, Neurocomputing.
[55] Jürgen Schmidhuber,et al. LSTM recurrent networks learn simple context-free and context-sensitive languages , 2001, IEEE Trans. Neural Networks.
[56] John F. Kolen,et al. Field Guide to Dynamical Recurrent Networks , 2001 .
[57] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[58] Srimat T. Chakradhar,et al. First-order versus second-order single-layer recurrent neural networks , 1994, IEEE Trans. Neural Networks.
[59] Michael Kearns,et al. Efficient noise-tolerant learning from statistical queries , 1993, STOC.
[60] George Cybenko,et al. Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..
[61] Kurt Hornik,et al. Multilayer feedforward networks are universal approximators , 1989, Neural Networks.
[62] Nathan Linial,et al. The influence of variables on Boolean functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.
[63] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[64] T. Sanders,et al. Analysis of Boolean Functions , 2012, ArXiv.
[65] Brendan T. O'Connor,et al. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics , 2011 .