A theory of high dimensional regression with arbitrary correlations between input features and target functions: sample complexity, multiple descent curves and a hierarchy of phase transitions

The performance of neural networks depends on precise relationships between four distinct ingredients: the architecture, the loss function, the statistical structure of inputs, and the ground truth target function. Much theoretical work has focused on understanding the role of the first two ingredients under highly simplified models of random uncorrelated data and target functions. In contrast, performance likely relies on a conspiracy between the statistical structure of the input distribution and the structure of the function to be learned. To understand this better we revisit ridge regression in high dimensions, which corresponds to an exceedingly simple architecture and loss function, but we analyze its performance under arbitrary correlations between input features and the target function. We find a rich mathematical structure that includes: (1) a dramatic reduction in sample complexity when the target function aligns with data anisotropy; (2) the existence of multiple descent curves; (3) a sequence of phase transitions in the performance, loss landscape, and optimal regularization as a function of the amount of data that explains the first two effects.

[1]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[2]  Surya Ganguli,et al.  Short-term memory in neuronal networks through dynamical compressed sensing , 2010, NIPS.

[3]  Monasson,et al.  Weight space structure and internal representations: A direct approach to learning and generalization in multilayer neural networks. , 1995, Physical review letters.

[4]  Andrea Montanari,et al.  When do neural networks outperform kernel methods? , 2020, NeurIPS.

[5]  Amin Karbasi,et al.  Multiple Descent: Design Your Own Generalization Curve , 2020, NeurIPS.

[6]  Levent Sagun,et al.  Triple descent and the two kinds of overfitting: where and why do they appear? , 2020, NeurIPS.

[7]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[8]  Surya Ganguli,et al.  An analytic theory of generalization dynamics and transfer learning in deep linear networks , 2018, ICLR.

[9]  Florent Krzakala,et al.  Generalisation error in learning with random features and the hidden manifold model , 2020, ICML.

[10]  Surya Ganguli,et al.  Statistical Mechanics of Deep Learning , 2020, Annual Review of Condensed Matter Physics.

[11]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[12]  Surya Ganguli,et al.  An equivalence between high dimensional Bayes optimal inference and M-estimation , 2016, NIPS.

[13]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[14]  Alexandru Nica,et al.  Lectures on the Combinatorics of Free Probability , 2006 .

[15]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[16]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[17]  Blake Bordelon,et al.  Spectral Bias and Task-Model Alignment Explain Generalization in Kernel Regression and Infinitely Wide Neural Networks , 2020 .

[18]  Surya Ganguli,et al.  Statistical Mechanics of Optimal Convex Inference in High Dimensions , 2016 .

[19]  S. Ganguli,et al.  Statistical mechanics of complex neural systems and high dimensional data , 2013, 1301.7115.

[20]  Surya Ganguli,et al.  Statistical mechanics of compressed sensing. , 2010, Physical review letters.

[21]  Sundeep Rangan,et al.  Asymptotic Analysis of MAP Estimation via the Replica Method and Applications to Compressed Sensing , 2009, IEEE Transactions on Information Theory.

[22]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[23]  Andrea Montanari,et al.  Message-passing algorithms for compressed sensing , 2009, Proceedings of the National Academy of Sciences.

[24]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .