Optimization-Based Separations for Neural Networks

Depth separation results propose a possible theoretical explanation for the benefits of deep neural networks over shallower architectures, establishing that the former possess superior approximation capabilities. However, there are no known results in which the deeper architecture leverages this advantage into a provable optimization guarantee. We prove that when the data are generated by a distribution with radial symmetry which satisfies some mild assumptions, gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations, and where the hidden layer is held fixed throughout training. By building on and refining existing techniques for approximation lower bounds of neural networks with a single layer of non-linearities, we show that there are d-dimensional radial distributions on the data such that ball indicators cannot be learned efficiently by any algorithm to accuracy better than Ω(d−4), nor by a standard gradient descent implementation to accuracy better than a constant. These results establish what is to the best of our knowledge, the first optimization-based separations where the approximation benefits of the stronger architecture provably manifest in practice. Our proof technique introduces new tools and ideas that may be of independent interest in the theoretical study of both the approximation and optimization of neural networks.

[1]  O. Shamir,et al.  Learning a Single Neuron with Bias Using Gradient Descent , 2021, NeurIPS.

[2]  Rocco A. Servedio,et al.  On the Approximation Power of Two-Layer Networks of Random ReLUs , 2021, COLT.

[3]  Joan Bruna,et al.  Depth separation beyond radial functions , 2021, J. Mach. Learn. Res..

[4]  Gilad Yehudai,et al.  The Connection Between Approximation, Depth Separation and Learnability in Neural Networks , 2021, COLT.

[5]  O. Shamir,et al.  Size and Depth Separation in Approximating Benign Functions with Neural Networks , 2021, COLT.

[6]  Yu Bai,et al.  Towards Understanding Hierarchical Learning: Benefits of Neural Representations , 2020, NeurIPS.

[7]  Ohad Shamir,et al.  Neural Networks with Small Weights and Depth-Separation Barriers , 2020, Electron. Colloquium Comput. Complex..

[8]  Quanquan Gu,et al.  Agnostic Learning of a Single Neuron with Gradient Descent , 2020, NeurIPS.

[9]  O. Shamir,et al.  Learning a Single Neuron with Gradient Methods , 2020, COLT 2020.

[10]  Yuanzhi Li,et al.  Backward Feature Correction: How Deep Learning Performs Deep Learning , 2020, ArXiv.

[11]  Matus Telgarsky,et al.  Neural tangent kernels, transportation mappings, and universal approximation , 2019, ICLR.

[12]  Jason D. Lee,et al.  Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks , 2019, ICLR.

[13]  Andrea Montanari,et al.  Linearized two-layers neural networks in high dimension , 2019, The Annals of Statistics.

[14]  Ohad Shamir,et al.  Depth Separations in Neural Networks: What is Actually Being Separated? , 2019, Constructive Approximation.

[15]  Gilad Yehudai,et al.  On the Power and Limitations of Random Features for Understanding Neural Networks , 2019, NeurIPS.

[16]  Shai Shalev-Shwartz,et al.  Is Deeper Better only when Shallow is Good? , 2019, NeurIPS.

[17]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[18]  Adam R. Klivans,et al.  Learning Neural Networks with Two Nonlinear Layers in Polynomial Time , 2017, COLT.

[19]  Yurii Nesterov,et al.  Lectures on Convex Optimization , 2018 .

[20]  Amit Daniely,et al.  Depth Separation for Neural Networks , 2017, COLT.

[21]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[22]  Ohad Shamir,et al.  Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks , 2016, ICML.

[23]  R. Srikant,et al.  Why Deep Neural Networks for Function Approximation? , 2016, ICLR.

[24]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[25]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[26]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[27]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[28]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[29]  Ohad Shamir,et al.  On the Quality of the Initial Basin in Overspecified Neural Networks , 2015, ICML.

[30]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[31]  J. Wellner,et al.  Log-Concavity and Strong Log-Concavity: a review. , 2014, Statistics surveys.

[32]  S. Rajsbaum Foundations of Cryptography , 2014 .

[33]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[34]  Santosh S. Vempala,et al.  The geometry of logconcave functions and sampling algorithms , 2007, Random Struct. Algorithms.

[35]  Oded Goldreich Foundations of Cryptography: Volume 1 , 2006 .

[36]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[37]  Peter Auer,et al.  Exponentially many local minima for single neurons , 1995, NIPS.

[38]  D. Owen Tables for Computing Bivariate Normal Probabilities , 1956 .