论文信息 - Learning One-hidden-layer ReLU Networks via Gradient Descent - 字舞流文

Learning One-hidden-layer ReLU Networks via Gradient Descent

We study the problem of learning one-hidden-layer neural networks with Rectified Linear Unit (ReLU) activation function, where the inputs are sampled from standard Gaussian distribution and the outputs are generated from a noisy teacher network. We analyze the performance of gradient descent for training such kind of neural networks based on empirical risk minimization, and provide algorithm-dependent guarantees. In particular, we prove that tensor initialization followed by gradient descent can converge to the ground-truth parameters at a linear rate up to some statistical error. To the best of our knowledge, this is the first work characterizing the recovery guarantee for practical learning of one-hidden-layer ReLU networks with multiple neurons. Numerical experiments verify our theoretical findings.

Xiao Zhang | Lingxiao Wang | Yaodong Yu | Quanquan Gu | Xiao Zhang | Lingxiao Wang | Quanquan Gu | Yaodong Yu

[1] Ohad Shamir,et al. Distribution-Specific Hardness of Learning Neural Networks , 2016, J. Mach. Learn. Res..

[2] Sanjeev Arora,et al. Provable learning of noisy-OR networks , 2016, STOC.

[3] Daniel Soudry,et al. No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[4] M. Irani. Vision Day Schedule Time Speaker and Collaborators Affiliation Title a General Preprocessing Method for Improved Performance of Epipolar Geometry Estimation Algorithms on the Expressive Power of Deep Learning: a Tensor Analysis , 2016 .

[5] Barnabás Póczos,et al. Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[6] Amitabh Basu,et al. Lower bounds over Boolean inputs for deep neural networks with ReLU gates , 2017, Electron. Colloquium Comput. Complex..

[7] Surya Ganguli,et al. On the Expressive Power of Deep Neural Networks , 2016, ICML.

[8] Ohad Shamir,et al. Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[9] Dmitry Yarotsky,et al. Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[10] Liwei Wang,et al. The Expressive Power of Neural Networks: A View from the Width , 2017, NIPS.

[11] Peter Auer,et al. Exponentially many local minima for single neurons , 1995, NIPS.

[12] Tara N. Sainath,et al. FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[13] Anima Anandkumar,et al. Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[14] Martin J. Wainwright,et al. On the Learnability of Fully-Connected Neural Networks , 2017, AISTATS.

[15] Yann LeCun,et al. The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[16] Furong Huang,et al. Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[17] Ohad Shamir,et al. Failures of Gradient-Based Deep Learning , 2017, ICML.

[18] Dmitry Yarotsky,et al. Optimal approximation of continuous functions by very deep ReLU networks , 2018, COLT.

[19] Mahdi Soltanolkotabi,et al. Learning ReLUs via Gradient Descent , 2017, NIPS.

[20] Adam Tauman Kalai,et al. Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression , 2011, NIPS.

[21] Tengyu Ma,et al. Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[22] Yingbin Liang,et al. Local Geometry of One-Hidden-Layer Neural Networks for Logistic Regression , 2018, ArXiv.

[23] Varun Kanade,et al. Reliably Learning the ReLU in Polynomial Time , 2016, COLT.

[24] Yuanzhi Li,et al. Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[25] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[26] Michael I. Jordan,et al. How to Escape Saddle Points Efficiently , 2017, ICML.

[27] Kenji Kawaguchi,et al. Deep Learning without Poor Local Minima , 2016, NIPS.

[28] Liwei Wang,et al. Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[29] Razvan Pascanu,et al. On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[30] Martin J. Wainwright,et al. Learning Halfspaces and Neural Networks with Random Initialization , 2015, ArXiv.

[31] Yuan Cao,et al. A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks , 2019, ArXiv.

[32] Ronald L. Rivest,et al. Training a 3-node neural network is NP-complete , 1988, COLT '88.

[33] Stuart Donnan,et al. In this number , 1994 .

[34] Roman Vershynin,et al. Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[35] Matus Telgarsky,et al. Benefits of Depth in Neural Networks , 2016, COLT.

[36] Yuandong Tian,et al. An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[37] Le Song,et al. Diverse Neural Network Learns True Target Functions , 2016, AISTATS.

[38] Kurt Hornik,et al. Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[39] Surya Ganguli,et al. Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[40] Mark Sellke,et al. Approximating Continuous Functions by ReLU Nets of Minimal Width , 2017, ArXiv.

[41] Jason D. Lee,et al. On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[42] Ohad Shamir,et al. Weight Sharing is Crucial to Succesful Optimization , 2017, ArXiv.

[43] Boris Hanin,et al. Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations , 2017, Mathematics.

[44] Andrew R. Barron,et al. Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[45] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[46] Yuanzhi Li,et al. A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[47] Adel Javanmard,et al. Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[48] Yuandong Tian,et al. Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.

[49] Tengyu Ma,et al. Identity Matters in Deep Learning , 2016, ICLR.

[50] Roi Livni,et al. On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[51] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[52] Adam Tauman Kalai,et al. The Isotron Algorithm: High-Dimensional Isotonic Regression , 2009, COLT.

[53] Yuanzhi Li,et al. Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[54] Anima Anandkumar,et al. Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[55] Suvrit Sra,et al. Global optimality conditions for deep neural networks , 2017, ICLR.

[56] Inderjit S. Dhillon,et al. Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[57] Vivek Srikumar,et al. Expressiveness of Rectifier Networks , 2015, ICML.

[58] Matthias Hein,et al. The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[59] Yoram Singer,et al. Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[60] Zhize Li,et al. Learning Two-layer Neural Networks with Symmetric Inputs , 2018, ICLR.

[61] ImageNet Classification with Deep Convolutional Neural , 2013 .

[62] Yuandong Tian,et al. When is a Convolutional Filter Easy To Learn? , 2017, ICLR.

[63] Ohad Shamir,et al. On the Quality of the Initial Basin in Overspecified Neural Networks , 2015, ICML.

[64] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[65] A. Montanari,et al. The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[66] Pramod Viswanath,et al. Learning One-hidden-layer Neural Networks under General Input Distributions , 2018, AISTATS.

[67] Yuanzhi Li,et al. Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[68] Prateek Jain,et al. Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[69] Amir Globerson,et al. Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.