Deep kernel processes

We define deep kernel processes in which positive definite Gram matrices are progressively transformed by nonlinear kernel functions and by sampling from (inverse) Wishart distributions. Remarkably, we find that deep Gaussian processes (DGPs), Bayesian neural networks (BNNs), infinite BNNs, and infinite BNNs with bottlenecks can all be written as deep kernel processes. For DGPs the equivalence arises because the Gram matrix formed by the inner product of features is Wishart distributed, and as we show, standard isotropic kernels can be written entirely in terms of this Gram matrix -- we do not need knowledge of the underlying features. We define a tractable deep kernel process, the deep inverse Wishart process, and give a doubly-stochastic inducing-point variational inference scheme that operates on the Gram matrices, not on the features, as in DGPs. We show that the deep inverse Wishart process gives superior performance to DGPs and infinite BNNs on standard fully-connected baselines.

[1]  Alexandre Lacoste,et al.  PAC-Bayesian Theory Meets Bayesian Inference , 2016, NIPS.

[2]  Laurence Aitchison Why bigger is not always better: on finite and infinite neural networks , 2020, ICML.

[3]  A. Dawid Some matrix-variate distribution theory: Notational considerations and a Bayesian application , 1981 .

[4]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[5]  Andrew Gordon Wilson,et al.  Generalised Wishart Processes , 2010, UAI.

[6]  Taras Bodnar,et al.  Singular inverse Wishart distribution and its application to portfolio theory , 2016, J. Multivar. Anal..

[7]  Carl Edward Rasmussen,et al.  Approximate Inference for Fully Bayesian Gaussian Process Regression , 2019, AABI.

[8]  Pat H. Sterbenz,et al.  Floating-point computation , 1973 .

[9]  Jaehoon Lee,et al.  Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes , 2018, ICLR.

[10]  M. L. Eaton Multivariate statistics : a vector space approach , 1985 .

[11]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[12]  Oliver Pfaffel Wishart Processes , 2012, 1201.3256.

[13]  Taras Bodnar,et al.  Properties of the singular, inverse and generalized inverse partitioned Wishart distributions , 2008 .

[14]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[15]  Richard E. Turner,et al.  Two problems with variational expectation maximisation for time-series models , 2011 .

[16]  David A. Moore Symmetrized Variational Inference , 2016 .

[17]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[18]  M. Glickman,et al.  Multivariate Stochastic Volatility via Wishart Processes , 2006 .

[19]  C. Gouriéroux,et al.  Derivative Pricing With Wishart Multivariate Stochastic Volatility , 2010 .

[20]  Linda R. Petzold,et al.  Improving the Identifiability of Neural Networks for Bayesian Inference , 2017 .

[21]  Muni S. Srivastava,et al.  Singular Wishart and multivariate beta distributions , 2003 .

[22]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[23]  Richard E. Turner,et al.  Gaussian Process Behaviour in Wide Deep Neural Networks , 2018, ICLR.

[24]  Neil D. Lawrence,et al.  Deep Gaussian Processes , 2012, AISTATS.

[25]  Martin Jorgensen,et al.  Stochastic Differential Equations with Variational Wishart Diffusions , 2020, ICML.

[26]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[27]  Mark van der Wilk,et al.  Scalable Bayesian dynamic covariance modeling with variational Wishart and inverse Wishart processes , 2019, NeurIPS.

[28]  M. McAleer,et al.  The structure of dynamic correlations in multivariate stochastic volatility models , 2009 .

[29]  Melih Kandemir,et al.  The Deep Feed-Forward Gaussian Process: An Effective Generalization to Covariance Priors , 2015, FE@NIPS.

[30]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  Andrew Gordon Wilson,et al.  Student-t Processes as Alternatives to Gaussian Processes , 2014, AISTATS.

[33]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[34]  Laurence Aitchison,et al.  Deep Convolutional Networks as shallow Gaussian Processes , 2018, ICLR.

[35]  H. Uhlig On singular Wishart and singular multivariate beta distributions , 1994 .

[36]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[37]  M. Glickman,et al.  Factor Multivariate Stochastic Volatility via Wishart Processes , 2006 .

[38]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[39]  Marc Peter Deisenroth,et al.  Doubly Stochastic Variational Inference for Deep Gaussian Processes , 2017, NIPS.

[40]  Alan Edelman,et al.  The efficient evaluation of the hypergeometric function of a matrix argument , 2006, Math. Comput..

[41]  Sebastian W. Ober,et al.  Global inducing point variational posteriors for Bayesian neural networks and deep Gaussian processes , 2020, ICML.