Sparse Gaussian process approximations and applications

Many tasks in machine learning require learning some kind of input-output relation (function), for example, recognising handwritten digits (from image to number) or learning the motion behaviour of a dynamical system like a pendulum (from positions and velocities now to future positions and velocities). We consider this problem using the Bayesian framework, where we use probability distributions to represent the state of uncertainty that a learning agent is in. In particular, we will investigate methods which use Gaussian processes to represent distributions over functions. Gaussian process models require approximations in order to be practically useful. This thesis focuses on understanding existing approximations and investigating new ones tailored to specific applications. We advance the understanding of existing techniques first through a thorough review. We propose desiderata for non-parametric basis function model approximations, which we use to assess the existing approximations. Following this, we perform an in-depth empirical investigation of two popular approximations (VFE and FITC). Based on the insights gained, we propose a new inter-domain Gaussian process approximation, which can be used to increase the sparsity of the approximation, in comparison to regular inducing point approximations. This allows GP models to be stored and communicated more compactly. Next, we show that inter-domain approximations can also allow the use of models which would otherwise be impractical, as opposed to improving existing approximations. We introduce an inter-domain approximation for the Convolutional Gaussian process – a model that makes Gaussian processes suitable to image inputs, and which has strong relations to convolutional neural networks. This same technique is valuable for approximating Gaussian processes with more general invariance properties. Finally, we revisit the derivation of the Gaussian process State Space Model, and discuss some subtleties relating to their approximation. We hope that this thesis illustrates some benefits of non-parametric models and their approximation in a non-parametric fashion, and that it provides models and approximations that prove to be useful for the development of more complex and performant models in the future.

[1]  Matthias W. Seeger,et al.  Bayesian Gaussian process models : PAC-Bayesian generalisation error bounds and sparse approximations , 2003 .

[2]  Carl E. Rasmussen,et al.  Convolutional Gaussian Processes , 2017, NIPS.

[3]  Arno Solin,et al.  Variational Fourier Features for Gaussian Processes , 2016, J. Mach. Learn. Res..

[4]  Carl E. Rasmussen,et al.  Gaussian Processes for Machine Learning (GPML) Toolbox , 2010, J. Mach. Learn. Res..

[5]  Maurizio Filippone,et al.  AutoGP: Exploring the Capabilities and Limitations of Gaussian Process Models , 2016, UAI.

[6]  Zoubin Ghahramani,et al.  Probabilistic machine learning and artificial intelligence , 2015, Nature.

[7]  A. Conv A Kronecker-factored approximate Fisher matrix for convolution layers , 2016 .

[8]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[9]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[10]  I. Kondor,et al.  Group theoretical methods in machine learning , 2008 .

[11]  Richard E. Turner,et al.  Learning Stationary Time Series using Gaussian Processes with Nonparametric Kernels , 2015, NIPS.

[12]  Carl E. Rasmussen,et al.  Sparse Spectrum Gaussian Process Regression , 2010, J. Mach. Learn. Res..

[13]  James Hensman,et al.  Scalable Variational Gaussian Process Classification , 2014, AISTATS.

[14]  Michalis K. Titsias,et al.  Variational Model Selection for Sparse Gaussian Process Regression , 2008 .

[15]  Edward Lloyd Snelson,et al.  Flexible and efficient Gaussian process models for machine learning , 2007 .

[16]  Richard E. Turner,et al.  Two problems with variational expectation maximisation for time-series models , 2011 .

[17]  Andrew Gordon Wilson,et al.  Gaussian Process Kernels for Pattern Discovery and Extrapolation , 2013, ICML.

[18]  Bernhard Schölkopf,et al.  Sparse multiscale gaussian process regression , 2008, ICML '08.

[19]  Michalis K. Titsias,et al.  Variational Learning of Inducing Variables in Sparse Gaussian Processes , 2009, AISTATS.

[20]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[21]  James Hensman,et al.  On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes , 2015, AISTATS.

[22]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[23]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[24]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[25]  Carl E. Rasmussen,et al.  Additive Gaussian Processes , 2011, NIPS.

[26]  G. Roberts,et al.  Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models , 2007, 0710.4228.

[27]  Neil D. Lawrence,et al.  Nested Variational Compression in Deep Gaussian Processes , 2014, 1412.1370.

[28]  D. Mackay,et al.  Bayesian methods for adaptive models , 1992 .

[29]  Neil D. Lawrence,et al.  Gaussian Processes for Big Data , 2013, UAI.

[30]  Dieter Fox,et al.  Learning GP-BayesFilters via Gaussian process latent variable models , 2009, Auton. Robots.

[31]  Joshua B. Tenenbaum,et al.  Structure Discovery in Nonparametric Regression through Compositional Kernel Search , 2013, ICML.

[32]  Carl E. Rasmussen,et al.  Manifold Gaussian Processes for regression , 2014, 2016 International Joint Conference on Neural Networks (IJCNN).

[33]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[34]  David J. C. MacKay,et al.  Comparison of Approximate Methods for Handling Hyperparameters , 1999, Neural Computation.

[35]  Carl E. Rasmussen,et al.  Understanding Probabilistic Sparse Gaussian Process Approximations , 2016, NIPS.

[36]  Carl E. Rasmussen,et al.  Variational Gaussian Process State-Space Models , 2014, NIPS.

[37]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[38]  Neil D. Lawrence,et al.  Deep Gaussian Processes , 2012, AISTATS.

[39]  David J. Fleet,et al.  Gaussian Process Dynamical Models , 2005, NIPS.

[40]  Kian Hsiang Low,et al.  A Distributed Variational Inference Framework for Unifying Parallel Sparse Gaussian Process Regression Models , 2016, ICML.

[41]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[42]  M. Gibbs,et al.  Efficient implementation of gaussian processes , 1997 .

[43]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[44]  Daniel Hernández-Lobato,et al.  Deep Gaussian Processes for Regression using Approximate Expectation Propagation , 2016, ICML.

[45]  Carl E. Rasmussen,et al.  Assessing Approximate Inference for Binary Gaussian Process Classification , 2005, J. Mach. Learn. Res..

[46]  David Duvenaud,et al.  Automatic model construction with Gaussian processes , 2014 .

[47]  Ecnica Superior,et al.  SPARSE GAUSSIAN PROCESSES FOR LARGE-SCALE MACHINE LEARNING , 2010 .

[48]  Sean Gerrish,et al.  Black Box Variational Inference , 2013, AISTATS.

[49]  Lehel Csató,et al.  Sparse On-Line Gaussian Processes , 2002, Neural Computation.

[50]  Sean B. Holden,et al.  The Generalized FITC Approximation , 2007, NIPS.

[51]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[52]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[53]  Richard E. Turner,et al.  Improving the Gaussian Process Sparse Spectrum Approximation by Representing Uncertainty in Frequency Inputs , 2015, ICML.

[54]  Neil D. Lawrence,et al.  Variational Inference for Latent Variables and Uncertain Inputs in Gaussian Processes , 2016, J. Mach. Learn. Res..

[55]  Carl E. Rasmussen,et al.  Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC , 2013, NIPS.

[56]  Neil D. Lawrence,et al.  Efficient Multioutput Gaussian Processes through Variational Inducing Kernels , 2010, AISTATS.

[57]  Kian Hsiang Low,et al.  A Unifying Framework of Anytime Sparse Gaussian Process Regression Models with Stochastic Variational Inference for Big Data , 2015, ICML.

[58]  Zoubin Ghahramani,et al.  Bayesian non-parametrics and the probabilistic approach to modelling , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[59]  James Hensman,et al.  MCMC for Variationally Sparse Gaussian Processes , 2015, NIPS.

[60]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[61]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[62]  C. Rasmussen,et al.  Approximations for Binary Gaussian Process Classification , 2008 .

[63]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[64]  Miguel Lázaro-Gredilla,et al.  Doubly Stochastic Variational Bayes for non-Conjugate Inference , 2014, ICML.

[65]  Ross D. Shachter Bayes-Ball: The Rational Pastime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams) , 1998, UAI.

[66]  Alexis Boukouvalas,et al.  GPflow: A Gaussian Process Library using TensorFlow , 2016, J. Mach. Learn. Res..

[67]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[68]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[69]  Neil D. Lawrence,et al.  Fast Forward Selection to Speed Up Sparse Gaussian Process Regression , 2003, AISTATS.

[70]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[71]  D. Ginsbourger,et al.  Additive Covariance Kernels for High-Dimensional Gaussian Process Modeling , 2011, 1111.6233.

[72]  Yuan Qi,et al.  Sparse-posterior Gaussian Processes for general likelihoods , 2010, UAI.

[73]  Aníbal R. Figueiras-Vidal,et al.  Inter-domain Gaussian Processes for Sparse Inference using Inducing Features , 2009, NIPS.

[74]  Andrew Gordon Wilson,et al.  Stochastic Variational Deep Kernel Learning , 2016, NIPS.

[75]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[76]  Neil D. Lawrence,et al.  Bayesian Gaussian Process Latent Variable Model , 2010, AISTATS.

[77]  D. Mackay,et al.  Introduction to Gaussian processes , 1998 .

[78]  Roger Frigola,et al.  Bayesian Time Series Learning with Gaussian Processes , 2015 .

[79]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[80]  Ambedkar Dukkipati,et al.  Learning by Stretching Deep Networks , 2014, ICML.

[81]  Maurizio Filippone,et al.  Random Feature Expansions for Deep Gaussian Processes , 2016, ICML.

[82]  Manfred Opper,et al.  Finite-Dimensional Approximation of Gaussian Processes , 1998, NIPS.

[83]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[84]  Carl E. Rasmussen,et al.  A Unifying View of Sparse Approximate Gaussian Process Regression , 2005, J. Mach. Learn. Res..

[85]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[86]  R. Baierlein Probability Theory: The Logic of Science , 2004 .

[87]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[88]  Andrew Gelman,et al.  Handbook of Markov Chain Monte Carlo , 2011 .

[89]  Alexander J. Smola,et al.  Fastfood - Computing Hilbert Space Expansions in loglinear time , 2013, ICML.

[90]  Richard E. Turner,et al.  A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation , 2016, J. Mach. Learn. Res..

[91]  Alexander G. de G. Matthews,et al.  Scalable Gaussian process inference using variational methods , 2017 .

[92]  H. Seal Studies in the history of probability and statistics , 1977 .

[93]  Ryan P. Adams,et al.  The Gaussian Process Density Sampler , 2008, NIPS.

[94]  Yarin Gal,et al.  Dropout Inference in Bayesian Neural Networks with Alpha-divergences , 2017, ICML.

[95]  Daniel Hernández-Lobato,et al.  Scalable Gaussian Process Classification via Expectation Propagation , 2015, AISTATS.

[96]  Carl Matthias Wise Occam's Razor , 1976 .

[97]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[98]  D. Ginsbourger,et al.  Invariances of random fields paths, with applications in Gaussian Process Regression , 2013, 1308.1359.

[99]  Yuesheng Xu,et al.  Universal Kernels , 2006, J. Mach. Learn. Res..

[100]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[101]  Alexandre Lacoste,et al.  PAC-Bayesian Theory Meets Bayesian Inference , 2016, NIPS.

[102]  R. T. Cox Probability, frequency and reasonable expectation , 1990 .

[103]  Marc Peter Deisenroth,et al.  Doubly Stochastic Variational Inference for Deep Gaussian Processes , 2017, NIPS.

[104]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[105]  Christopher K. I. Williams,et al.  Observations on the Nyström Method for Gaussian Processes , 2002 .

[106]  Cordelia Schmid,et al.  Convolutional Kernel Networks , 2014, NIPS.