Learning ReLU Networks via Alternating Minimization

We propose and analyze a new family of algorithms for training neural networks with ReLU activations. Our algorithms are based on the technique of alternating minimization: estimating the activation patterns of each ReLU for all given samples, interleaved with weight updates via a least-squares step. The main focus of our paper are 1-hidden layer networks with $k$ hidden neurons and ReLU activation. We show that under standard distributional assumptions on the $d-$dimensional input data, our algorithm provably recovers the true `ground truth' parameters in a linearly convergent fashion. This holds as long as the weights are sufficiently well initialized; furthermore, our method requires only $n=\widetilde{O}(dk^2)$ samples. We also analyze the special case of 1-hidden layer networks with skipped connections, commonly used in ResNet-type architectures, and propose a novel initialization strategy for the same. For ReLU based ResNet type networks, we provide the first linear convergence guarantee with an end-to-end algorithm. We also extend this framework to deeper networks and empirically demonstrate its convergence to a global minimum.

[1]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[2]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Mahdi Soltanolkotabi,et al.  Learning ReLUs via Gradient Descent , 2017, NIPS.

[5]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[6]  Prateek Jain,et al.  Phase Retrieval Using Alternating Minimization , 2013, IEEE Transactions on Signal Processing.

[7]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[8]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[9]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[10]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[11]  Chinmay Hegde,et al.  Fast, Sample-Efficient Algorithms for Structured Phase Retrieval , 2017, NIPS.

[12]  Yingbin Liang,et al.  Reshaped Wirtinger Flow for Solving Quadratic System of Equations , 2016, NIPS.

[13]  Xiaodong Li,et al.  Phase Retrieval from Coded Diffraction Patterns , 2013, 1310.3240.

[14]  J. Mixter Fast , 2012 .

[15]  Philip M. Long,et al.  Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks , 2018, Neural Computation.

[16]  Zheng Xu,et al.  Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.

[17]  Sanjeev Arora,et al.  Simple, Efficient, and Neural Algorithms for Sparse Coding , 2015, COLT.

[18]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[19]  Alexandre d'Aspremont,et al.  Phase recovery, MaxCut and complex semidefinite programming , 2012, Math. Program..

[20]  Yuandong Tian,et al.  Symmetry-Breaking Convergence Analysis of Certain Two-layered Neural Networks with ReLU nonlinearity , 2017, ICLR.

[21]  Andrzej Rusiecki,et al.  Training Neural Networks on Noisy Data , 2014, ICAISC.

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  Gang Wang,et al.  Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization , 2018, IEEE Transactions on Signal Processing.

[24]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[25]  Thomas Laurent,et al.  The Multilinear Structure of ReLU Networks , 2017, ICML.

[26]  Chinmay Hegde,et al.  Sample-Efficient Algorithms for Recovering Structured Signals From Magnitude-Only Measurements , 2017, IEEE Transactions on Information Theory.

[27]  Xiaodong Li,et al.  Solving Quadratic Equations via PhaseLift When There Are About as Many Equations as Unknowns , 2012, Found. Comput. Math..