Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers

A common lens to theoretically study neural net architectures is to analyze the functions they can approximate. However, constructions from approximation theory may be unrealistic and therefore less meaningful. For example, a common unrealistic trick is to encode target function values using infinite precision. To address these issues, this work proposes a formal definition of statistically meaningful (SM) approximation which requires the approximating network to exhibit good statistical learnability. We study SM approximation for two function classes: boolean circuits and Turing machines. We show that overparameterized feedforward neural nets can SM approximate boolean circuits with sample complexity depending only polynomially on the circuit size, not the size of the network. In addition, we show that transformers can SM approximate Turing machines with computation time bounded by T with sample complexity polynomial in the alphabet size, state space size, and logpT q. We also introduce new tools for analyzing generalization which provide much tighter sample complexities than the typical VC-dimension or norm-based bounds, which may be of independent interest.

[1]  Surya Ganguli,et al.  RNNs Can Generate Bounded Hierarchical Languages with Optimal Memory , 2020, EMNLP.

[2]  Ding-Xuan Zhou,et al.  Universality of Deep Convolutional Neural Networks , 2018, Applied and Computational Harmonic Analysis.

[3]  Pablo Barceló,et al.  On the Turing Completeness of Modern Neural Network Architectures , 2019, ICLR.

[4]  Stefanie Jegelka,et al.  ResNet with one-neuron hidden layers is a Universal Approximator , 2018, NeurIPS.

[5]  Xiao Wang,et al.  Depth-Width Trade-offs for ReLU Networks via Sharkovsky's Theorem , 2019, ICLR.

[6]  Andreas Maletti,et al.  Recurrent Neural Networks as Weighted Language Recognizers , 2017, NAACL.

[7]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[8]  Nadav Cohen,et al.  On the Expressive Power of Deep Learning: A Tensor Analysis , 2015, COLT 2016.

[9]  Navin Goyal,et al.  On the Computational Power of Transformers and Its Implications in Sequence Modeling , 2020, CONLL.

[10]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[11]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[12]  Ah Chung Tsoi,et al.  Computational Capabilities of Graph Neural Networks , 2009, IEEE Transactions on Neural Networks.

[13]  Alon Itai,et al.  Learnability with Respect to Fixed Distributions , 1991, Theor. Comput. Sci..

[14]  Matthias Hein,et al.  Optimization Landscape and Expressivity of Deep CNNs , 2017, ICML.

[15]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[16]  Eran Yahav,et al.  On the Practical Computational Power of Finite Precision RNNs for Language Recognition , 2018, ACL.

[17]  Tengyu Ma,et al.  On the Ability of Neural Nets to Express Distributions , 2017, COLT.

[18]  Abbas Mehrabian,et al.  Nearly-tight VC-dimension bounds for piecewise linear neural networks , 2017, COLT.

[19]  Lei Wu,et al.  Approximation Analysis of Convolutional Neural Networks , 2023, East Asian Journal on Applied Mathematics.

[20]  Gilad Yehudai,et al.  The Connection Between Approximation, Depth Separation and Learnability in Neural Networks , 2021, COLT.

[21]  Yoshua Bengio,et al.  On the Expressive Power of Deep Architectures , 2011, ALT.

[22]  Raman Arora,et al.  Understanding Deep Neural Networks with Rectified Linear Units , 2016, Electron. Colloquium Comput. Complex..

[23]  Ohad Shamir,et al.  Neural Networks with Small Weights and Depth-Separation Barriers , 2020, Electron. Colloquium Comput. Complex..

[24]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[25]  Matus Telgarsky,et al.  Neural tangent kernels, transportation mappings, and universal approximation , 2020, ICLR.

[26]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[27]  John E. Savage,et al.  Models of computation - exploring the power of computing , 1998 .

[28]  Dmitry Yarotsky,et al.  Universal Approximations of Invariant Maps by Neural Networks , 2018, Constructive Approximation.

[29]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[30]  Samuel A. Korsky,et al.  On the Computational Power of RNNs , 2019, ArXiv.

[31]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[34]  Roman Vershynin,et al.  Memory capacity of neural networks with threshold and ReLU activations , 2020, ArXiv.

[35]  Colin Wei,et al.  Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation , 2019, NeurIPS.

[36]  Suvrit Sra,et al.  Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity , 2018, NeurIPS.

[37]  Ankit Singh Rawat,et al.  Are Transformers universal approximators of sequence-to-sequence functions? , 2020, ICLR.

[38]  Colin Wei,et al.  Improved Sample Complexities for Deep Networks and Robust Classification via an All-Layer Margin , 2019, ICLR.

[39]  Amit Daniely,et al.  Depth Separation for Neural Networks , 2017, COLT.

[40]  Hans-Georg Zimmermann,et al.  Recurrent Neural Networks Are Universal Approximators , 2006, ICANN.

[41]  Peter L. Bartlett,et al.  Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks , 2017, J. Mach. Learn. Res..

[42]  Eitan M. Gurari,et al.  Introduction to the theory of computation , 1989 .

[43]  J. Zico Kolter,et al.  Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience , 2019, ICLR.

[44]  Amnon Shashua,et al.  Convolutional Rectifier Networks as Generalized Tensor Decompositions , 2016, ICML.

[45]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[46]  Hava T. Siegelmann,et al.  On the Computational Power of Neural Nets , 1995, J. Comput. Syst. Sci..

[47]  Takeshi Teshima,et al.  Universal Approximation Property of Neural Ordinary Differential Equations , 2020, ArXiv.

[48]  Jinwoo Shin,et al.  Provable Memorization via Deep Neural Networks using Sub-linear Parameters , 2020, COLT.

[49]  Akiyoshi Sannai,et al.  Universal approximations of permutation invariant/equivariant functions by deep neural networks , 2019, ArXiv.

[50]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[51]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[52]  Eduardo D. Sontag,et al.  Vapnik-Chervonenkis Dimension of Recurrent Neural Networks , 1997, Discret. Appl. Math..

[53]  Jooyoung Park,et al.  Universal Approximation Using Radial-Basis-Function Networks , 1991, Neural Computation.

[54]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.