Learning One-hidden-layer ReLU Networks via Gradient Descent

We study the problem of learning one-hidden-layer neural networks with Rectified Linear Unit (ReLU) activation function, where the inputs are sampled from standard Gaussian distribution and the outputs are generated from a noisy teacher network. We analyze the performance of gradient descent for training such kind of neural networks based on empirical risk minimization, and provide algorithm-dependent guarantees. In particular, we prove that tensor initialization followed by gradient descent can converge to the ground-truth parameters at a linear rate up to some statistical error. To the best of our knowledge, this is the first work characterizing the recovery guarantee for practical learning of one-hidden-layer ReLU networks with multiple neurons. Numerical experiments verify our theoretical findings.

[1]  Ohad Shamir,et al.  Distribution-Specific Hardness of Learning Neural Networks , 2016, J. Mach. Learn. Res..

[2]  Sanjeev Arora,et al.  Provable learning of noisy-OR networks , 2016, STOC.

[3]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[4]  M. Irani Vision Day Schedule Time Speaker and Collaborators Affiliation Title a General Preprocessing Method for Improved Performance of Epipolar Geometry Estimation Algorithms on the Expressive Power of Deep Learning: a Tensor Analysis , 2016 .

[5]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[6]  Amitabh Basu,et al.  Lower bounds over Boolean inputs for deep neural networks with ReLU gates , 2017, Electron. Colloquium Comput. Complex..

[7]  Surya Ganguli,et al.  On the Expressive Power of Deep Neural Networks , 2016, ICML.

[8]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[9]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[10]  Liwei Wang,et al.  The Expressive Power of Neural Networks: A View from the Width , 2017, NIPS.

[11]  Peter Auer,et al.  Exponentially many local minima for single neurons , 1995, NIPS.

[12]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[13]  Anima Anandkumar,et al.  Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[14]  Martin J. Wainwright,et al.  On the Learnability of Fully-Connected Neural Networks , 2017, AISTATS.

[15]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[16]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[17]  Ohad Shamir,et al.  Failures of Gradient-Based Deep Learning , 2017, ICML.

[18]  Dmitry Yarotsky,et al.  Optimal approximation of continuous functions by very deep ReLU networks , 2018, COLT.

[19]  Mahdi Soltanolkotabi,et al.  Learning ReLUs via Gradient Descent , 2017, NIPS.

[20]  Adam Tauman Kalai,et al.  Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression , 2011, NIPS.

[21]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[22]  Yingbin Liang,et al.  Local Geometry of One-Hidden-Layer Neural Networks for Logistic Regression , 2018, ArXiv.

[23]  Varun Kanade,et al.  Reliably Learning the ReLU in Polynomial Time , 2016, COLT.

[24]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[25]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[26]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[27]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[28]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[29]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[30]  Martin J. Wainwright,et al.  Learning Halfspaces and Neural Networks with Random Initialization , 2015, ArXiv.

[31]  Yuan Cao,et al.  A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks , 2019, ArXiv.

[32]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[33]  Stuart Donnan,et al.  In this number , 1994 .

[34]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[35]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[36]  Yuandong Tian,et al.  An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[37]  Le Song,et al.  Diverse Neural Network Learns True Target Functions , 2016, AISTATS.

[38]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[39]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[40]  Mark Sellke,et al.  Approximating Continuous Functions by ReLU Nets of Minimal Width , 2017, ArXiv.

[41]  Jason D. Lee,et al.  On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[42]  Ohad Shamir,et al.  Weight Sharing is Crucial to Succesful Optimization , 2017, ArXiv.

[43]  Boris Hanin,et al.  Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations , 2017, Mathematics.

[44]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[45]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[46]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[47]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[48]  Yuandong Tian,et al.  Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.

[49]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[50]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[51]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[52]  Adam Tauman Kalai,et al.  The Isotron Algorithm: High-Dimensional Isotonic Regression , 2009, COLT.

[53]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[54]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[55]  Suvrit Sra,et al.  Global optimality conditions for deep neural networks , 2017, ICLR.

[56]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[57]  Vivek Srikumar,et al.  Expressiveness of Rectifier Networks , 2015, ICML.

[58]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[59]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[60]  Zhize Li,et al.  Learning Two-layer Neural Networks with Symmetric Inputs , 2018, ICLR.

[61]  ImageNet Classification with Deep Convolutional Neural , 2013 .

[62]  Yuandong Tian,et al.  When is a Convolutional Filter Easy To Learn? , 2017, ICLR.

[63]  Ohad Shamir,et al.  On the Quality of the Initial Basin in Overspecified Neural Networks , 2015, ICML.

[64]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[65]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[66]  Pramod Viswanath,et al.  Learning One-hidden-layer Neural Networks under General Input Distributions , 2018, AISTATS.

[67]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[68]  Prateek Jain,et al.  Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[69]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.