Understanding Implicit Regularization in Over-Parameterized Single Index Model

In this paper, we leverage over-parameterization to design regularization-free algorithms for the high-dimensional single index model and provide theoretical guarantees for the induced implicit regularization phenomenon. Specifically, we study both vector and matrix single index models where the link function is nonlinear and unknown, the signal parameter is either a sparse vector or a low-rank symmetric matrix, and the response variable can be heavy-tailed. To gain a better understanding of the role played by implicit regularization without excess technicality, we assume that the distribution of the covariates is known a priori. For both the vector and matrix settings, we construct an over-parameterized least-squares loss function by employing the score function transform and a robust truncation step designed specifically for heavy-tailed data. We propose to estimate the true parameter by applying regularization-free gradient descent to the loss function. When the initialization is close to the origin and the stepsize is sufficiently small, we prove that the obtained solution achieves minimax optimal statistical rates of convergence in both the vector and matrix cases. In addition, our experimental results support our theoretical findings and also demonstrate that our methods empirically outperform classical methods with explicit regularization in terms of both `2-statistical rate and variable selection consistency.

[1]  C. Stein A bound for the error in the normal approximation to the distribution of a sum of dependent random variables , 1972 .

[2]  Aaron K. Han Non-parametric analysis of a generalized regression model: the maximum rank correlation estimator , 1987 .

[3]  Ker-Chau Li,et al.  Regression Analysis Under Link Violation , 1989 .

[4]  Ker-Chau Li,et al.  Slicing Regression: A Link-Free Regression Method , 1991 .

[5]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[6]  Ker-Chau Li,et al.  On Principal Hessian Directions for Data Visualization and Dimension Reduction: Another Application of Stein's Lemma , 1992 .

[7]  W. Härdle,et al.  Optimal Smoothing in Single-index Models , 1993 .

[8]  Jianqing Fan,et al.  Generalized Partially Linear Single-Index Models , 1997 .

[9]  R. Cook,et al.  Principal Hessian Directions Revisited , 1998 .

[10]  R. Cook,et al.  Dimension Reduction in Binary Response Regression , 1999 .

[11]  Yingcun Xia,et al.  On extended partially linear single-index models , 1999 .

[12]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[13]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[14]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[15]  P. Diaconis,et al.  Use of exchangeable pairs in the analysis of simulations , 2004 .

[16]  R. Cook,et al.  Sufficient Dimension Reduction via Inverse Regression , 2005 .

[17]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[18]  J. Horowitz Semiparametric and Nonparametric Methods in Econometrics , 2007 .

[19]  E. Candès The restricted isometry property and its implications for compressed sensing , 2008 .

[20]  Y. Xia,et al.  A Multiple-Index Model and Dimension Reduction , 2008 .

[21]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[22]  Pablo A. Parrilo,et al.  Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization , 2007, SIAM Rev..

[23]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[24]  V. Koltchinskii,et al.  Nuclear norm penalization and optimal rates for noisy low rank matrix completion , 2010, 1011.6256.

[25]  A. Tsybakov,et al.  Estimation of high-dimensional low-rank matrices , 2009, 0912.5338.

[26]  O. Catoni Challenging the empirical mean and empirical variance: a deviation study , 2010, 1009.2048.

[27]  Martin J. Wainwright,et al.  Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.

[28]  H. Bernhard Schlegel,et al.  Geometry optimization , 2011 .

[29]  Yaniv Plan,et al.  One‐Bit Compressed Sensing by Linear Programming , 2011, ArXiv.

[30]  D. Brillinger A Generalized Linear Model With “Gaussian” Regressor Variables , 2012 .

[31]  Yaniv Plan,et al.  Robust 1-bit Compressed Sensing and Sparse Logistic Regression: A Convex Programming Approach , 2012, IEEE Transactions on Information Theory.

[32]  Jonathan E. Taylor,et al.  On model selection consistency of regularized M-estimators , 2013, ArXiv.

[33]  Laurent Jacques,et al.  Robust 1-Bit Compressive Sensing via Binary Stable Embeddings of Sparse Vectors , 2011, IEEE Transactions on Information Theory.

[34]  H. Zou,et al.  STRONG ORACLE OPTIMALITY OF FOLDED CONCAVE PENALIZED ESTIMATION. , 2012, Annals of statistics.

[35]  Y. Plan,et al.  High-dimensional estimation with geometric constraints , 2014, 1404.3749.

[36]  Bo Jiang,et al.  Variable selection for general index models via sliced inverse regression , 2013, 1304.4056.

[37]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[38]  A. Bandeira,et al.  Sharp nonasymptotic bounds on the norm of random matrices with independent entries , 2014, 1408.6185.

[39]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[40]  Jun S. Liu,et al.  On consistency and sparsity for sliced inverse regression in high dimensions , 2015, 1507.03895.

[41]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[42]  Yonina C. Eldar,et al.  Phase Retrieval with Application to Optical Imaging: A contemporary overview , 2015, IEEE Signal Processing Magazine.

[43]  G. Lugosi,et al.  Empirical risk minimization for heavy-tailed losses , 2014, 1406.2462.

[44]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[45]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[46]  Christos Thrampoulidis,et al.  LASSO with Non-linear Measurements is Equivalent to One With Linear Measurements , 2015, NIPS.

[47]  Yonina C. Eldar,et al.  Phase Retrieval via Matrix Completion , 2011, SIAM Rev..

[48]  Zhaoran Wang,et al.  Agnostic Estimation for Misspecified Phase Retrieval Models , 2020, NIPS.

[49]  Razvan Pascanu,et al.  Local minima in training of neural networks , 2016, 1611.06310.

[50]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[51]  Yue Zhang,et al.  On the Consistency of Feature Selection With Lasso for Non-linear Targets , 2016, ICML.

[52]  Yaniv Plan,et al.  The Generalized Lasso With Non-Linear Observations , 2015, IEEE Transactions on Information Theory.

[53]  Qing Ling,et al.  On the Convergence of Decentralized Gradient Descent , 2013, SIAM J. Optim..

[54]  Tianxi Cai,et al.  L1-Regularized Least Squares for Support Recovery of High Dimensional Single Index Models with Gaussian Designs , 2015, J. Mach. Learn. Res..

[55]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[56]  Xiaohan Wei,et al.  Non-Gaussian Observations in Nonlinear Compressed Sensing via Stein Discrepancies , 2016, 1609.08512.

[57]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[58]  Qiang Liu,et al.  Communication-efficient Sparse Regression , 2017, J. Mach. Learn. Res..

[59]  Marc Peter Deisenroth,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[60]  Peter D. Hoff,et al.  Lasso, fractional norm and structured sparse estimation using a Hadamard product parametrization , 2016, Comput. Stat. Data Anal..

[61]  Krishnakumar Balasubramanian,et al.  Estimating High-dimensional Non-Gaussian Multiple Index Models via Stein's Lemma , 2017, NIPS.

[62]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[63]  Ziwei Zhu Taming the heavy-tailed features by shrinkage and clipping , 2017, 1710.09020.

[64]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[65]  Yuxi Li,et al.  Deep Reinforcement Learning: An Overview , 2017, ArXiv.

[66]  Xiaohan Wei,et al.  Estimation of the covariance structure of heavy-tailed distributions , 2017, NIPS.

[67]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[68]  Krishnakumar Balasubramanian,et al.  High-dimensional Non-Gaussian Single Index Models via Thresholded Score Function Estimation , 2017, ICML.

[69]  Martin Genzel,et al.  High-Dimensional Estimation of Structured Signals From Non-Linear Observations With General Convex Loss Functions , 2016, IEEE Transactions on Information Theory.

[70]  Lorenzo Rosasco,et al.  Theory of Deep Learning III: explaining the non-overfitting puzzle , 2017, ArXiv.

[71]  Jianqing Fan,et al.  DISTRIBUTED TESTING AND ESTIMATION UNDER SPARSE HIGH DIMENSIONAL MODELS. , 2018, Annals of statistics.

[72]  R. Dennis Cook,et al.  Sparse Minimum Discrepancy Approach to Sufficient Dimension Reduction with Simultaneous Variable Selection in Ultrahigh Dimension , 2018, Journal of the American Statistical Association.

[73]  Wei Hu,et al.  Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced , 2018, NeurIPS.

[74]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[75]  Yun Yang,et al.  Communication-Efficient Distributed Statistical Inference , 2016, Journal of the American Statistical Association.

[76]  Xiaohan Wei,et al.  Structured Signal Recovery From Non-Linear and Heavy-Tailed Measurements , 2016, IEEE Transactions on Information Theory.

[77]  Francis Bach,et al.  Slice inverse regression with score functions , 2018 .

[78]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[79]  Jianqing Fan,et al.  LARGE COVARIANCE ESTIMATION THROUGH ELLIPTICAL FACTOR MODELS. , 2015, Annals of statistics.

[80]  Stanislav Minsker,et al.  Robust modifications of U-statistics and applications to covariance estimation problems , 2018, Bernoulli.

[81]  Xiaohan Wei,et al.  Structured Recovery with Heavy-tailed Measurements: A Thresholding Procedure and Optimal Rates , 2018, 1804.05959.

[82]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[83]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[84]  Justin A. Sirignano,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[85]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[86]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[87]  Yi Zhou,et al.  When Will Gradient Methods Converge to Max-margin Classifier under ReLU Models? , 2018 .

[88]  Christos Thrampoulidis,et al.  The Generalized Lasso for Sub-gaussian Observations with Dithered Quantization , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[89]  Zhaoran Wang,et al.  A convex formulation for high‐dimensional sparse sliced inverse regression , 2018, ArXiv.

[90]  Hongyang Zhang,et al.  Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations , 2017, COLT.

[91]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[92]  Nikolaos Doulamis,et al.  Deep Learning for Computer Vision: A Brief Review , 2018, Comput. Intell. Neurosci..

[93]  Krishnakumar Balasubramanian,et al.  Tensor Methods for Additive Index Models under Discordance and Heterogeneity , 2018, 1807.06693.

[94]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[95]  Stanislav Minsker Sub-Gaussian estimators of the mean of a random matrix with heavy-tailed entries , 2016, The Annals of Statistics.

[96]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[97]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[98]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[99]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[100]  Babak Hassibi,et al.  Stochastic Mirror Descent on Overparameterized Nonlinear Models , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[101]  A. Montanari,et al.  The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime , 2019 .

[102]  Sen Na,et al.  High-dimensional Varying Index Coefficient Models via Stein's Identity , 2018, J. Mach. Learn. Res..

[103]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[104]  Qiang Sun,et al.  User-Friendly Covariance Estimation for Heavy-Tailed Distributions , 2018, Statistical Science.

[105]  Lin F. Yang,et al.  Misspecified nonconvex statistical optimization for sparse phase retrieval , 2019, Mathematical Programming.

[106]  Matus Telgarsky,et al.  Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[107]  Jun S. Liu,et al.  Sparse Sliced Inverse Regression via Lasso , 2016, Journal of the American Statistical Association.

[108]  Nathan Srebro,et al.  Convergence of Gradient Descent on Separable Data , 2018, AISTATS.

[109]  Jianqing Fan,et al.  Communication-Efficient Accurate Statistical Estimation , 2019, Journal of the American Statistical Association.

[110]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[111]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[112]  Christos Thrampoulidis,et al.  A Model of Double Descent for High-dimensional Binary Linear Classification , 2019, Information and Inference: A Journal of the IMA.

[113]  Jianqing Fan,et al.  Robust Covariance Estimation for Approximate Factor Models. , 2016, Journal of econometrics.

[114]  Matus Telgarsky,et al.  A refined primal-dual analysis of the implicit bias , 2019, ArXiv.

[115]  Matus Telgarsky,et al.  The implicit bias of gradient descent on nonseparable data , 2019, COLT.

[116]  Colin Wei,et al.  Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[117]  Yuan Cao,et al.  A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks , 2019, ArXiv.

[118]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[119]  E Weinan,et al.  On the Generalization Properties of Minimum-norm Solutions for Over-parameterized Neural Network Models , 2019, ArXiv.

[120]  P. Zhao,et al.  Implicit regularization via hadamard product over-parametrization in high-dimensional linear regression , 2019 .

[121]  Suvrit Sra,et al.  Small nonlinearities in activation functions create bad local minima in neural networks , 2018, ICLR.

[122]  Anant Sahai,et al.  Harmless interpolation of noisy data in regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[123]  Samet Oymak,et al.  Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? , 2018, ICML.

[124]  Francis Bach,et al.  Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks , 2019, NeurIPS.

[125]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[126]  Varun Kanade,et al.  Implicit Regularization for Optimal Sparse Recovery , 2019, NeurIPS.

[127]  Gilad Yehudai,et al.  On the Power and Limitations of Random Features for Understanding Neural Networks , 2019, NeurIPS.

[128]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[129]  Jason D. Lee,et al.  Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks , 2019, ICLR.

[130]  Yuxin Chen,et al.  Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution , 2017, Found. Comput. Math..

[131]  Tuo Zhao,et al.  Why Do Deep Residual Networks Generalize Better than Deep Feedforward Networks? - A Neural Tangent Kernel Perspective , 2020, NeurIPS.

[132]  L. Rosasco,et al.  Decentralised Learning with Distributed Gradient Descent and Random Features , 2020 .

[133]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[134]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[135]  Lei Wu,et al.  A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics , 2019, Science China Mathematics.

[136]  Rouzbeh A. Shirvani,et al.  Natural Language Processing Advancements By Deep Learning: A Survey , 2020, ArXiv.

[137]  Christos Thrampoulidis,et al.  Analytic Study of Double Descent in Binary Classification: The Impact of Loss , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[138]  Michael W. Mahoney,et al.  Exact expressions for double descent and implicit regularization via surrogate random design , 2019, NeurIPS.

[139]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[140]  Patrick Rebeschini,et al.  Graph-Dependent Implicit Regularisation for Distributed Stochastic Subgradient Descent , 2018, J. Mach. Learn. Res..

[141]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[142]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[143]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[144]  Kaifeng Lyu,et al.  Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning , 2020, ICLR.

[145]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[146]  Weichen Wang,et al.  A SHRINKAGE PRINCIPLE FOR HEAVY-TAILED DATA: HIGH-DIMENSIONAL ROBUST LOW-RANK MATRIX RECOVERY. , 2016, Annals of statistics.

[147]  Cong Ma,et al.  A Selective Overview of Deep Learning , 2019, Statistical science : a review journal of the Institute of Mathematical Statistics.

[148]  Jugal K. Kalita,et al.  A Survey of the Usages of Deep Learning for Natural Language Processing , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[149]  Jianqing Fan,et al.  Robust high dimensional factor models with applications to statistical machine learning. , 2018, Statistical science : a review journal of the Institute of Mathematical Statistics.

[150]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.