Asymptotics of Ridge Regression in Convolutional Models

Understanding generalization and estimation error of estimators for simple models such as linear and generalized linear models has attracted a lot of attention recently. This is in part due to an interesting observation made in machine learning community that highly over-parameterized neural networks achieve zero training error, and yet they are able to generalize well over the test samples. This phenomenon is captured by the so called double descent curve, where the generalization error starts decreasing again after the interpolation threshold. A series of recent works tried to explain such phenomenon for simple models. In this work, we analyze the asymptotics of estimation error in ridge estimators for convolutional linear models. These convolutional inverse problems, also known as deconvolution, naturally arise in different fields such as seismology, imaging, and acoustics among others. Our results hold for a large class of input distributions that include i.i.d. features as a special case. We derive exact formulae for estimation error of ridge estimators that hold in a certain high-dimensional regime. We show the double descent phenomenon in our experiments for convolutional models and show that our theoretical results match the experiments.

[1]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[2]  Andrea Montanari,et al.  The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime , 2019 .

[3]  Yu-Hsien Peng On Singular Values of Random Matrices , 2015 .

[4]  Andrea Montanari,et al.  Message passing algorithms for compressed sensing: I. motivation and construction , 2009, 2010 IEEE Information Theory Workshop on Information Theory (ITW 2010, Cairo).

[5]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[6]  Samet Oymak,et al.  Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks , 2019, AISTATS.

[7]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[8]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[9]  E. I. Jury,et al.  On the roots of a real polynomial inside the unit circle and a stability criterion for linear discrete systems , 1963 .

[10]  Zhenyu Liao,et al.  A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent , 2020, NeurIPS.

[11]  C. Villani Optimal Transport: Old and New , 2008 .

[12]  Andrea Vedaldi,et al.  Deep Image Prior , 2017, International Journal of Computer Vision.

[13]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[14]  Sundeep Rangan,et al.  Generalization Error of Generalized Linear Models in High Dimensions , 2020, ICML.

[15]  Magda Peligrad,et al.  Central limit theorem for fourier transforms of stationary processes. , 2009, 0910.3451.

[16]  Charles S. Mueller,et al.  Source pulse enhancement by deconvolution of an empirical Green's function , 1985 .

[17]  Greg Yang,et al.  Tensor Programs II: Neural Tangent Kernel for Any Architecture , 2020, ArXiv.

[18]  Sundeep Rangan,et al.  Inference With Deep Generative Priors in High Dimensions , 2019, IEEE Journal on Selected Areas in Information Theory.

[19]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[20]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[21]  J. Conchello,et al.  Three-dimensional imaging by deconvolution microscopy. , 1999, Methods.

[22]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[23]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[24]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[25]  Sven Treitel,et al.  Linear inverse theory and deconvolution , 1982 .

[26]  Fred Zhang,et al.  SGD on Neural Networks Learns Functions of Increasing Complexity , 2019, NeurIPS.

[27]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[28]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[29]  Zichao Wang,et al.  The Recurrent Neural Tangent Kernel , 2021, ICLR.

[30]  Yann LeCun,et al.  Second Order Properties of Error Surfaces: Learning Time and Generalization , 1990, NIPS 1990.

[31]  Stefan Wager,et al.  High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification , 2015, 1507.03003.

[32]  Sundeep Rangan,et al.  Vector approximate message passing , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[33]  Christos Thrampoulidis,et al.  A Model of Double Descent for High-dimensional Binary Linear Classification , 2019, ArXiv.

[34]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[35]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[36]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[37]  Lee H. Dicker,et al.  Ridge regression and asymptotic minimax estimation over spheres of growing dimension , 2016, 1601.03900.

[38]  Ruosong Wang,et al.  Enhanced Convolutional Neural Tangent Kernels , 2019, ArXiv.

[39]  Liam Paninski,et al.  Fast online deconvolution of calcium imaging data , 2016, PLoS Comput. Biol..

[40]  Andrea Montanari,et al.  The dynamics of message passing on dense graphs, with applications to compressed sensing , 2010, 2010 IEEE International Symposium on Information Theory.

[41]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[42]  Fionn Murtagh,et al.  Deconvolution in Astronomy: A Review , 2002 .

[43]  Samet Oymak,et al.  Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? , 2018, ICML.

[44]  Richard G. Baraniuk,et al.  Asymptotic Analysis of Complex LASSO via Complex Approximate Message Passing (CAMP) , 2011, IEEE Transactions on Information Theory.

[45]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.