A fast point solver for deep nonlinear function approximators

Deep kernel processes (DKPs) generalise Bayesian neural networks, but do not require us to represent either features or weights. Instead, at each hidden layer they represent and optimize a flexible kernel. Here, we develop a Newton-like method for DKPs that converges in around 10 steps, exploiting matrix solvers initially developed in the control theory literature. These are many times faster the usual gradient descent approach. We generalise to arbitrary DKP architectures, by developing “kernel backprop”, and algorithms for “kernel autodiff”. While these methods currently are not Bayesian as they give point estimates and scale poorly as they are cubic in the number of datapoints, we hope they will form the basis of a new class of much more efficient approaches to optimizing deep nonlinear function approximators.

[1]  Edward J. Hu,et al.  Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks , 2021, ICML.

[2]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[3]  Roger B. Grosse,et al.  A Kronecker-factored approximate Fisher matrix for convolution layers , 2016, ICML.

[4]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[5]  Richard H. Bartels,et al.  Algorithm 432 [C2]: Solution of the matrix equation AX + XB = C [F4] , 1972, Commun. ACM.

[6]  Guodong Zhang,et al.  Noisy Natural Gradient as Variational Inference , 2017, ICML.

[7]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[8]  Sebastian W. Ober,et al.  A variational approximate posterior for the deep Wishart process , 2021, ArXiv.

[9]  Héctor J. Sussmann,et al.  Uniqueness of the weights for minimal feedforward nets with a given input-output map , 1992, Neural Networks.

[10]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[11]  Muni S. Srivastava,et al.  Singular Wishart and multivariate beta distributions , 2003 .

[12]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[13]  H. Uhlig On singular Wishart and singular multivariate beta distributions , 1994 .

[14]  D. Kleinman On an iterative technique for Riccati equation computations , 1968 .

[15]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[16]  Guillaume Hennequin,et al.  Exact natural gradient in deep linear networks and its application to the nonlinear case , 2018, NeurIPS.

[17]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[18]  Richard E. Turner,et al.  Gaussian Process Behaviour in Wide Deep Neural Networks , 2018, ICLR.

[19]  Adam X. Yang,et al.  Deep kernel processes , 2020, ICML.

[20]  Neil D. Lawrence,et al.  Deep Gaussian Processes , 2012, AISTATS.

[21]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[22]  Abdulkadir Canatar,et al.  Asymptotics of representation learning in finite Bayesian neural networks , 2021, ArXiv.

[23]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[24]  James Hensman,et al.  Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models , 2018, AISTATS.

[25]  Marc Peter Deisenroth,et al.  Doubly Stochastic Variational Inference for Deep Gaussian Processes , 2017, NIPS.

[26]  Laurence Aitchison Why bigger is not always better: on finite and infinite neural networks , 2020, ICML.

[27]  Leiba Rodman,et al.  Algebraic Riccati equations , 1995 .

[28]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[29]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[30]  Peter Benner,et al.  Computational Methods for Linear-Quadratic Optimization , 1999 .

[31]  J. A. Díaz-García,et al.  On Wishart distribution , 2010, 1010.1799.

[32]  Yanzhao Wu,et al.  Demystifying Learning Rate Policies for High Accuracy Training of Deep Neural Networks , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[33]  Laurence Aitchison,et al.  Deep Convolutional Networks as shallow Gaussian Processes , 2018, ICLR.

[34]  Jaehoon Lee,et al.  Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes , 2018, ICLR.

[35]  Taras Bodnar,et al.  Properties of the singular, inverse and generalized inverse partitioned Wishart distributions , 2008 .

[36]  Dan A. Simovici,et al.  Bayesian Learning , 2019, Variational Bayesian Learning Theory.

[37]  Ryan P. Adams,et al.  Avoiding pathologies in very deep networks , 2014, AISTATS.

[38]  Philipp Hennig,et al.  Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers , 2020, ICML.

[39]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.