Learning Two-Layer Residual Networks with Nonparametric Function Estimation by Convex Programming

We focus on learning a two-layer residual neural network with preactivation by ReLU (preReLU-TLRN): Suppose the input $\mathbf{x}$ is from a distribution with support space $\mathbb{R}^d$ and the ground-truth generative model is a preReLU-TLRN, given by $$\mathbf{y} = \boldsymbol{B}^\ast\left[\left(\boldsymbol{A}^\ast\mathbf{x}\right)^+ + \mathbf{x}\right]\text{,}$$ where ground-truth network parameters $\boldsymbol{A}^\ast \in \mathbb{R}^{d\times d}$ is a nonnegative full-rank matrix and $\boldsymbol{B}^\ast \in \mathbb{R}^{m\times d}$ is full-rank with $m \geq d$. We design layerwise objectives as functionals whose analytic minimizers sufficiently express the exact ground-truth network in terms of its parameters and nonlinearities. Following this objective landscape, learning a preReLU-TLRN from finite samples can be formulated as convex programming with nonparametric function estimation: For each layer, we first formulate the corresponding empirical risk minimization (ERM) as convex quadratic programming (QP), then we show the solution space of the QP can be equivalently determined by a set of linear inequalities, which can then be efficiently solved by linear programming (LP). Experiments show the robustness and sample efficiency of our methods.

[1]  Zhize Li,et al.  Learning Two-layer Neural Networks with Symmetric Inputs , 2018, ICLR.

[2]  A. Wald,et al.  On Stochastic Limit and Order Relationships , 1943 .

[3]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[4]  Arun K. Kuchibhotla,et al.  Efficient Estimation in Convex Single Index Models , 2017 .

[5]  Yin Zhang,et al.  On the Superlinear and Quadratic Convergence of Primal-Dual Interior Point Linear Programming Algorithms , 1992, SIAM J. Optim..

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Mahdi Soltanolkotabi,et al.  Learning ReLUs via Gradient Descent , 2017, NIPS.

[8]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[9]  Yoshua Bengio,et al.  On the Expressive Power of Deep Architectures , 2011, ALT.

[10]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[11]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[12]  Mary C. Meyer A Simple New Algorithm for Quadratic Programming with Applications in Statistics , 2013, Commun. Stat. Simul. Comput..

[13]  J. Zico Kolter,et al.  Provable defenses against adversarial examples via the convex outer adversarial polytope , 2017, ICML.

[14]  D. Luenberger Optimization by Vector Space Methods , 1968 .

[15]  Florian Jarre,et al.  On the convergence of the method of analytic centers when applied to convex quadratic programs , 1991, Math. Program..

[16]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[17]  I. Johnstone High Dimensional Statistical Inference and Random Matrices , 2006, math/0611589.

[18]  Adityanand Guntuboyina,et al.  Nonparametric Shape-Restricted Regression , 2017, Statistical Science.

[19]  Simon S. Du,et al.  Improved Learning of One-hidden-layer Convolutional Neural Networks with Overlaps , 2018, ArXiv.

[20]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[21]  Stephen P. Boyd,et al.  Graph Implementations for Nonsmooth Convex Programs , 2008, Recent Advances in Learning and Control.

[22]  Adam Tauman Kalai,et al.  Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression , 2011, NIPS.

[23]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, Comb..

[24]  Alexandros G. Dimakis,et al.  Learning Distributions Generated by One-Layer ReLU Networks , 2019, NeurIPS.

[25]  Yinyu Ye,et al.  An extension of Karmarkar's projective algorithm for convex quadratic programming , 1989, Math. Program..

[26]  Yuanzhi Li,et al.  What Can ResNet Learn Efficiently, Going Beyond Kernels? , 2019, NeurIPS.

[27]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[28]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[29]  Katta G. Murty,et al.  Computational complexity of parametric linear programming , 1980, Math. Program..

[30]  E. Seijo,et al.  Nonparametric Least Squares Estimation of a Multivariate Convex Regression Function , 2010, 1003.4765.

[31]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[32]  L. Khachiyan,et al.  The polynomial solvability of convex quadratic programming , 1980 .

[33]  Constance Van Eeden Maximum Likelihood Estimation Of Ordered Probabilities1) , 1956 .

[34]  R. Jennrich Asymptotic Properties of Non-Linear Least Squares Estimators , 1969 .

[35]  H. D. Brunk,et al.  AN EMPIRICAL DISTRIBUTION FUNCTION FOR SAMPLING WITH INCOMPLETE INFORMATION , 1955 .

[36]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[37]  Kim-Chuan Toh,et al.  SDPT3 -- A Matlab Software Package for Semidefinite Programming , 1996 .

[38]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[39]  H. Robbins,et al.  Strong consistency of least squares estimates in multiple regression , 1978 .

[40]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[41]  Adam R. Klivans,et al.  Learning Neural Networks with Two Nonlinear Layers in Polynomial Time , 2017, COLT.

[42]  M. Borel Les probabilités dénombrables et leurs applications arithmétiques , 1909 .

[43]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[44]  R. Samworth,et al.  Generalized additive and index models with shape constraints , 2014, 1404.2957.

[45]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[46]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[47]  H. D. Brunk Maximum Likelihood Estimates of Monotone Parameters , 1955 .

[48]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[49]  Renato D. C. Monteiro,et al.  Interior path following primal-dual algorithms. part II: Convex quadratic programming , 1989, Math. Program..

[50]  David A. Freedman,et al.  Statistical Models: Theory and Practice: References , 2005 .

[51]  Aditya Bhaskara,et al.  Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[52]  Yuandong Tian,et al.  An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.