The Connection Between Approximation, Depth Separation and Learnability in Neural Networks

Several recent works have shown separation results between deep neural networks, and hypothesis classes with inferior approximation capacity such as shallow networks or kernel classes. On the other hand, the fact that deep networks can efficiently express a target function does not mean this target function can be learned efficiently by deep neural networks. In this work we study the intricate connection between learnability and approximation capacity. We show that learnability with deep networks of a target function depends on the ability of simpler classes to approximate the target. Specifically, we show that a necessary condition for a function to be learnable by gradient descent on deep neural networks is to be able to approximate the function, at least in a weak sense, with shallow neural networks. We also show that a class of functions can be learned by an efficient statistical query algorithm if and only if it can be approximated in a weak sense by some kernel class. We give several examples of functions which demonstrate depth separation, and conclude that they cannot be efficiently learned, even by a hypothesis class that can efficiently approximate them.

[1]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[2]  Michael Kearns,et al.  Efficient noise-tolerant learning from statistical queries , 1993, STOC.

[3]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[4]  Yishay Mansour,et al.  Weakly learning DNF and characterizing statistical query learning using Fourier analysis , 1994, STOC '94.

[5]  Hans Ulrich Simon A Characterization of Strong Learnability in the Statistical Query Model , 2007, STACS.

[6]  A. Rahimi,et al.  Uniform approximation of functions with random bases , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[7]  Balázs Szörényi Characterizing Statistical Query Learning: Simplified Notions and Proofs , 2009, ALT.

[8]  Vitaly Feldman,et al.  A Complete Characterization of Statistical Query Learning with Applications to Evolvability , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[9]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[10]  Varun Kanade,et al.  Computational Bounds on Statistical Query Learning , 2012, COLT.

[11]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[12]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[13]  Ohad Shamir,et al.  Failures of Gradient-Based Deep Learning , 2017, ICML.

[14]  Ohad Shamir,et al.  Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks , 2016, ICML.

[15]  Amit Daniely,et al.  Depth Separation for Neural Networks , 2017, COLT.

[16]  Ambuj Tewari,et al.  On the Approximation Properties of Random ReLU Features , 2018 .

[17]  Ohad Shamir,et al.  Distribution-Specific Hardness of Learning Neural Networks , 2016, J. Mach. Learn. Res..

[18]  Wei Hu,et al.  Width Provably Matters in Optimization for Deep Linear Neural Networks , 2019, ICML.

[19]  S. Shalev-Shwartz,et al.  Learning Boolean Circuits with Neural Networks , 2019, ArXiv.

[20]  Ohad Shamir,et al.  Depth Separations in Neural Networks: What is Actually Being Separated? , 2019, Constructive Approximation.

[21]  Shai Shalev-Shwartz,et al.  Is Deeper Better only when Shallow is Good? , 2019, NeurIPS.

[22]  Yuanzhi Li,et al.  What Can ResNet Learn Efficiently, Going Beyond Kernels? , 2019, NeurIPS.

[23]  Gilad Yehudai,et al.  On the Power and Limitations of Random Features for Understanding Neural Networks , 2019, NeurIPS.

[24]  Ohad Shamir,et al.  Neural Networks with Small Weights and Depth-Separation Barriers , 2020, Electron. Colloquium Comput. Complex..

[25]  Quanquan Gu,et al.  Agnostic Learning of a Single Neuron with Gradient Descent , 2020, NeurIPS.

[26]  Xiao Wang,et al.  Depth-Width Trade-offs for ReLU Networks via Sharkovsky's Theorem , 2019, ICLR.

[27]  Amit Daniely,et al.  Learning Parities with Neural Networks , 2020, NeurIPS.

[28]  Ioannis Panageas,et al.  Better Depth-Width Trade-offs for Neural Networks through the lens of Dynamical Systems , 2020, ICML.

[29]  Adam R. Klivans,et al.  Superpolynomial Lower Bounds for Learning One-Layer Neural Networks using Gradient Descent , 2020, ICML.

[30]  Emmanuel Abbe,et al.  Poly-time universality and limitations of deep learning , 2020, ArXiv.

[31]  Nathan Srebro,et al.  Approximate is Good Enough: Probabilistic Variants of Dimensional and Margin Complexity , 2020, COLT 2020.

[32]  Toniann Pitassi,et al.  Size and Depth Separation in Approximating Natural Functions with Neural Networks , 2021, ArXiv.