Understanding Implicit Regularization in Over-Parameterized Nonlinear Statistical Model

We study the implicit regularization phenomenon induced by simple optimization algorithms in over-parameterized nonlinear statistical models. Specifically, we study both vector and matrix single index models where the link function is nonlinear and unknown, the signal parameter is either a sparse vector or a low-rank symmetric matrix, and the response variable can be heavy-tailed. To gain a better understanding the role of implicit regularization in the nonlinear models without excess technicality, we assume that the distribution of the covariates is known as a priori. For both the vector and matrix settings, we construct an over-parameterized least-squares loss function by employing the score function transform and a robust truncation step designed specifically for heavy-tailed data. We propose to estimate the true parameter by applying regularization-free gradient descent to the loss function. When the initialization is close to the origin and the stepsize is sufficiently small, we prove that the obtained solution achieves minimax optimal statistical rates of convergence in both the vector and matrix cases. In particular, for the vector single index model with Gaussian covariates, our proposed estimator is shown to enjoy the oracle statistical rate. Our results capture the implicit regularization phenomenon in over-parameterized nonlinear and noisy statistical models with possibly heavy-tailed data.

[1]  Matus Telgarsky,et al.  Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[2]  Xiaohan Wei,et al.  Structured Signal Recovery From Non-Linear and Heavy-Tailed Measurements , 2016, IEEE Transactions on Information Theory.

[3]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[4]  Yue Zhang,et al.  On the Consistency of Feature Selection With Lasso for Non-linear Targets , 2016, ICML.

[5]  Sen Na,et al.  High-dimensional Varying Index Coefficient Models via Stein's Identity , 2018, J. Mach. Learn. Res..

[6]  Jianqing Fan,et al.  Robust high dimensional factor models with applications to statistical machine learning. , 2018, Statistical science : a review journal of the Institute of Mathematical Statistics.

[7]  Christos Thrampoulidis,et al.  LASSO with Non-linear Measurements is Equivalent to One With Linear Measurements , 2015, NIPS.

[8]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[9]  Jianqing Fan,et al.  Robust Covariance Estimation for Approximate Factor Models. , 2016, Journal of econometrics.

[10]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[11]  Varun Kanade,et al.  Implicit Regularization for Optimal Sparse Recovery , 2019, NeurIPS.

[12]  Francis Bach,et al.  Slice inverse regression with score functions , 2018 .

[13]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[14]  Jun S. Liu,et al.  Sparse Sliced Inverse Regression via Lasso , 2016, Journal of the American Statistical Association.

[15]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[16]  Babak Hassibi,et al.  Stochastic Mirror Descent on Overparameterized Nonlinear Models , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Nikolaos Doulamis,et al.  Deep Learning for Computer Vision: A Brief Review , 2018, Comput. Intell. Neurosci..

[18]  Zhaoran Wang,et al.  A convex formulation for high‐dimensional sparse sliced inverse regression , 2018, ArXiv.

[19]  Ker-Chau Li,et al.  Regression Analysis Under Link Violation , 1989 .

[20]  Yonina C. Eldar,et al.  Phase Retrieval via Matrix Completion , 2011, SIAM Rev..

[21]  Zhaoran Wang,et al.  Agnostic Estimation for Misspecified Phase Retrieval Models , 2020, NIPS.

[22]  Krishnakumar Balasubramanian,et al.  Estimating High-dimensional Non-Gaussian Multiple Index Models via Stein's Lemma , 2017, NIPS.

[23]  Wei Hu,et al.  Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced , 2018, NeurIPS.

[24]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[25]  Yuxi Li,et al.  Deep Reinforcement Learning: An Overview , 2017, ArXiv.

[26]  A. Bandeira,et al.  Sharp nonasymptotic bounds on the norm of random matrices with independent entries , 2014, 1408.6185.

[27]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[28]  P. Diaconis,et al.  Use of exchangeable pairs in the analysis of simulations , 2004 .

[29]  Tuo Zhao,et al.  Why Do Deep Residual Networks Generalize Better than Deep Feedforward Networks? - A Neural Tangent Kernel Perspective , 2020, NeurIPS.

[30]  Cong Ma,et al.  A Selective Overview of Deep Learning , 2019, Statistical science : a review journal of the Institute of Mathematical Statistics.

[31]  Yaniv Plan,et al.  Robust 1-bit Compressed Sensing and Sparse Logistic Regression: A Convex Programming Approach , 2012, IEEE Transactions on Information Theory.

[32]  Xiaohan Wei,et al.  Estimation of the covariance structure of heavy-tailed distributions , 2017, NIPS.

[33]  R. Cook,et al.  Dimension Reduction in Binary Response Regression , 1999 .

[34]  Edward A. Fox,et al.  Natural Language Processing Advancements By Deep Learning: A Survey , 2020, ArXiv.

[35]  Konstantinos Spiliopoulos,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[36]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[37]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[38]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[39]  Justin A. Sirignano,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[40]  Krishnakumar Balasubramanian,et al.  High-dimensional Non-Gaussian Single Index Models via Thresholded Score Function Estimation , 2017, ICML.

[41]  Colin Wei,et al.  Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[42]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[43]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[44]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[45]  Christos Thrampoulidis,et al.  The Generalized Lasso for Sub-gaussian Observations with Dithered Quantization , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[46]  Michael W. Mahoney,et al.  Exact expressions for double descent and implicit regularization via surrogate random design , 2019, NeurIPS.

[47]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[48]  W. Härdle,et al.  Optimal Smoothing in Single-index Models , 1993 .

[49]  V. Koltchinskii,et al.  Nuclear norm penalization and optimal rates for noisy low rank matrix completion , 2010, 1011.6256.

[50]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[51]  Christos Thrampoulidis,et al.  A Model of Double Descent for High-dimensional Binary Linear Classification , 2019, ArXiv.

[52]  Pablo A. Parrilo,et al.  Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization , 2007, SIAM Rev..

[53]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[54]  R. Cook,et al.  Principal Hessian Directions Revisited , 1998 .

[55]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[56]  H. Zou,et al.  STRONG ORACLE OPTIMALITY OF FOLDED CONCAVE PENALIZED ESTIMATION. , 2012, Annals of statistics.

[57]  R. Cook,et al.  Sufficient Dimension Reduction via Inverse Regression , 2005 .

[58]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[59]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[60]  P. Zhao,et al.  Implicit regularization via hadamard product over-parametrization in high-dimensional linear regression , 2019 .

[61]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[62]  Matus Telgarsky,et al.  The implicit bias of gradient descent on nonseparable data , 2019, COLT.

[63]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[64]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[65]  Martin Genzel,et al.  High-Dimensional Estimation of Structured Signals From Non-Linear Observations With General Convex Loss Functions , 2016, IEEE Transactions on Information Theory.

[66]  Marc Peter Deisenroth,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[67]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[68]  Yuxin Chen,et al.  Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution , 2017, Found. Comput. Math..

[69]  Stanislav Minsker,et al.  Robust modifications of U-statistics and applications to covariance estimation problems , 2018, Bernoulli.

[70]  Lorenzo Rosasco,et al.  Theory of Deep Learning III: explaining the non-overfitting puzzle , 2017, ArXiv.

[71]  Christos Thrampoulidis,et al.  Analytic Study of Double Descent in Binary Classification: The Impact of Loss , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[72]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[73]  Quanquan Gu,et al.  Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks , 2019, AAAI.

[74]  C. Stein A bound for the error in the normal approximation to the distribution of a sum of dependent random variables , 1972 .

[75]  Jugal K. Kalita,et al.  A Survey of the Usages of Deep Learning for Natural Language Processing , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[76]  Martin J. Wainwright,et al.  Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.

[77]  G. Lugosi,et al.  Empirical risk minimization for heavy-tailed losses , 2014, 1406.2462.

[78]  Samet Oymak,et al.  Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? , 2018, ICML.

[79]  Y. Plan,et al.  High-dimensional estimation with geometric constraints , 2014, 1404.3749.

[80]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[81]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[82]  Y. Xia,et al.  A Multiple-Index Model and Dimension Reduction , 2008 .

[83]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[84]  Anant Sahai,et al.  Harmless interpolation of noisy data in regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[85]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[86]  Xiaohan Wei,et al.  Non-Gaussian Observations in Nonlinear Compressed Sensing via Stein Discrepancies , 2016, 1609.08512.

[87]  Yuan Cao,et al.  A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks , 2019, ArXiv.

[88]  Peter D. Hoff,et al.  Lasso, fractional norm and structured sparse estimation using a Hadamard product parametrization , 2016, Comput. Stat. Data Anal..

[89]  Yaniv Plan,et al.  The Generalized Lasso With Non-Linear Observations , 2015, IEEE Transactions on Information Theory.

[90]  Hongyang Zhang,et al.  Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations , 2017, COLT.

[91]  Qiang Sun,et al.  User-Friendly Covariance Estimation for Heavy-Tailed Distributions , 2018, Statistical Science.

[92]  O. Catoni Challenging the empirical mean and empirical variance: a deviation study , 2010, 1009.2048.

[93]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[94]  R. Dennis Cook,et al.  Sparse Minimum Discrepancy Approach to Sufficient Dimension Reduction with Simultaneous Variable Selection in Ultrahigh Dimension , 2018, Journal of the American Statistical Association.

[95]  D. Brillinger A Generalized Linear Model With “Gaussian” Regressor Variables , 2012 .

[96]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[97]  Gilad Yehudai,et al.  On the Power and Limitations of Random Features for Understanding Neural Networks , 2019, NeurIPS.

[98]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[99]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[100]  A. Tsybakov,et al.  Estimation of high-dimensional low-rank matrices , 2009, 0912.5338.

[101]  A. Montanari,et al.  The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime , 2019 .

[102]  Ruslan Salakhutdinov,et al.  Geometry of Optimization and Implicit Regularization in Deep Learning , 2017, ArXiv.

[103]  Ziwei Zhu,et al.  Taming heavy-tailed features by shrinkage , 2021, AISTATS.

[104]  Yaniv Plan,et al.  One‐Bit Compressed Sensing by Linear Programming , 2011, ArXiv.

[105]  Xiaohan Wei,et al.  Structured Recovery with Heavy-tailed Measurements: A Thresholding Procedure and Optimal Rates , 2018, 1804.05959.

[106]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[107]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[108]  Weichen Wang,et al.  A SHRINKAGE PRINCIPLE FOR HEAVY-TAILED DATA: HIGH-DIMENSIONAL ROBUST LOW-RANK MATRIX RECOVERY. , 2016, Annals of statistics.

[109]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[110]  Jianqing Fan,et al.  Generalized Partially Linear Single-Index Models , 1997 .

[111]  J. Horowitz Semiparametric and Nonparametric Methods in Econometrics , 2007 .

[112]  Lei Wu,et al.  A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics , 2019, Science China Mathematics.

[113]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[114]  E. Candès The restricted isometry property and its implications for compressed sensing , 2008 .

[115]  Yingcun Xia,et al.  On extended partially linear single-index models , 1999 .

[116]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[117]  Tianxi Cai,et al.  L1-Regularized Least Squares for Support Recovery of High Dimensional Single Index Models with Gaussian Designs , 2015, J. Mach. Learn. Res..

[118]  Laurent Jacques,et al.  Robust 1-Bit Compressive Sensing via Binary Stable Embeddings of Sparse Vectors , 2011, IEEE Transactions on Information Theory.

[119]  Lin F. Yang,et al.  Misspecified nonconvex statistical optimization for sparse phase retrieval , 2019, Mathematical Programming.

[120]  Jun S. Liu,et al.  On consistency and sparsity for sliced inverse regression in high dimensions , 2015, 1507.03895.

[121]  Jianqing Fan,et al.  LARGE COVARIANCE ESTIMATION THROUGH ELLIPTICAL FACTOR MODELS. , 2015, Annals of statistics.

[122]  Matus Telgarsky,et al.  A refined primal-dual analysis of the implicit bias , 2019, ArXiv.

[123]  E Weinan,et al.  On the Generalization Properties of Minimum-norm Solutions for Over-parameterized Neural Network Models , 2019, ArXiv.

[124]  Yi Zhou,et al.  When Will Gradient Methods Converge to Max-margin Classifier under ReLU Models? , 2018 .

[125]  Jason D. Lee,et al.  Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks , 2019, ICLR.

[126]  Suvrit Sra,et al.  Small nonlinearities in activation functions create bad local minima in neural networks , 2018, ICLR.

[127]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[128]  Aaron K. Han Non-parametric analysis of a generalized regression model: the maximum rank correlation estimator , 1987 .

[129]  Ker-Chau Li,et al.  Slicing Regression: A Link-Free Regression Method , 1991 .

[130]  Ker-Chau Li,et al.  On Principal Hessian Directions for Data Visualization and Dimension Reduction: Another Application of Stein's Lemma , 1992 .

[131]  H. Bernhard Schlegel,et al.  Geometry optimization , 2011 .

[132]  Krishnakumar Balasubramanian,et al.  Tensor Methods for Additive Index Models under Discordance and Heterogeneity , 2018, 1807.06693.

[133]  Nathan Srebro,et al.  Convergence of Gradient Descent on Separable Data , 2018, AISTATS.

[134]  Bo Jiang,et al.  Variable selection for general index models via sliced inverse regression , 2013, 1304.4056.

[135]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[136]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[137]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[138]  Razvan Pascanu,et al.  Local minima in training of neural networks , 2016, 1611.06310.

[139]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[140]  Ziwei Zhu Taming the heavy-tailed features by shrinkage and clipping , 2017, 1710.09020.

[141]  Stanislav Minsker Sub-Gaussian estimators of the mean of a random matrix with heavy-tailed entries , 2016, The Annals of Statistics.

[142]  Francis Bach,et al.  Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks , 2019, NeurIPS.