Learning Invariances using the Marginal Likelihood

Generalising well in supervised learning tasks relies on correctly extrapolating the training data to a large region of the input space. One way to achieve this is to constrain the predictions to be invariant to transformations on the input that are known to be irrelevant (e.g. translation). Commonly, this is done through data augmentation, where the training set is enlarged by applying hand-crafted transformations to the inputs. We argue that invariances should instead be incorporated in the model structure, and learned using the marginal likelihood, which correctly rewards the reduced complexity of invariant models. We demonstrate this for Gaussian process models, due to the ease with which their marginal likelihood can be estimated. Our main contribution is a variational inference scheme for Gaussian processes containing invariances described by a sampling procedure. We learn the sampling procedure by back-propagating through it to maximise the marginal likelihood.

[1]  Carl E. Rasmussen,et al.  Convolutional Gaussian Processes , 2017, NIPS.

[2]  Matthias W. Seeger,et al.  Bayesian Gaussian process models : PAC-Bayesian generalisation error bounds and sparse approximations , 2003 .

[3]  John W. Fisher,et al.  Dreaming More Data: Class-dependent Distributions over Diffeomorphisms for Learned Data Augmentation , 2015, AISTATS.

[4]  Yann LeCun,et al.  Transformation invariance in pattern recognition: Tangent distance and propagation , 2000, Int. J. Imaging Syst. Technol..

[5]  Marius Kloft,et al.  Efficient Gaussian Process Classification Using Polya-Gamma Data Augmentation , 2018, AAAI.

[6]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[7]  Iain Murray Introduction To Gaussian Processes , 2008 .

[8]  Amos J. Storkey,et al.  Data Augmentation Generative Adversarial Networks , 2017, ICLR 2018.

[9]  James Hensman,et al.  On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes , 2015, AISTATS.

[10]  Bernhard Schölkopf,et al.  Incorporating Invariances in Non-Linear Support Vector Machines , 2001, NIPS.

[11]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[12]  Bernhard Schölkopf,et al.  Prior Knowledge in Support Vector Kernels , 1997, NIPS.

[13]  D. Ginsbourger,et al.  Invariances of random fields paths, with applications in Gaussian Process Regression , 2013, 1308.1359.

[14]  Carl E. Rasmussen,et al.  A Unifying View of Sparse Approximate Gaussian Process Regression , 2005, J. Mach. Learn. Res..

[15]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[16]  Carl E. Rasmussen,et al.  Understanding Probabilistic Sparse Gaussian Process Approximations , 2016, NIPS.

[17]  Scott W. Linderman,et al.  Dependent Multinomial Models Made Easy: Stick-Breaking with the Polya-gamma Augmentation , 2015, NIPS.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[20]  James Hensman,et al.  Scalable Variational Gaussian Process Classification , 2014, AISTATS.

[21]  David Beymer,et al.  Face recognition from one example view , 1995, Proceedings of IEEE International Conference on Computer Vision.

[22]  I. Kondor,et al.  Group theoretical methods in machine learning , 2008 .

[23]  S. Canu,et al.  Training Invariant Support Vector Machines using Selective Sampling , 2005 .

[24]  Tri Dao,et al.  A Kernel Theory of Modern Data Augmentation , 2018, ICML.

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Alexandre Lacoste,et al.  PAC-Bayesian Theory Meets Bayesian Inference , 2016, NIPS.

[27]  Thore Graepel,et al.  Invariant Pattern Recognition by Semi-Definite Programming Machines , 2003, NIPS.

[28]  Neil D. Lawrence,et al.  Gaussian Processes for Big Data , 2013, UAI.

[29]  Yee Whye Teh,et al.  Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes , 2017, AISTATS.

[30]  D. Ginsbourger,et al.  Argumentwise invariant kernels for the approximation of invariant functions , 2012 .

[31]  David J. C. MacKay,et al.  Variational Gaussian process classifiers , 2000, IEEE Trans. Neural Networks Learn. Syst..

[32]  Max Welling,et al.  Group Equivariant Convolutional Networks , 2016, ICML.

[33]  Christopher K. I. Williams,et al.  Model Selection and Adaptation of Hyperparameters , 2005 .

[34]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[35]  James Hensman,et al.  MCMC for Variationally Sparse Gaussian Processes , 2015, NIPS.

[36]  Aníbal R. Figueiras-Vidal,et al.  Inter-domain Gaussian Processes for Sparse Inference using Inducing Features , 2009, NIPS.

[37]  James G. Scott,et al.  Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables , 2012, 1205.0310.

[38]  Tomaso Poggio,et al.  Incorporating prior information in machine learning by creating virtual examples , 1998, Proc. IEEE.

[39]  Yann LeCun,et al.  Tangent Prop - A Formalism for Specifying Selected Invariances in an Adaptive Network , 1991, NIPS.

[40]  David Ginsbourger,et al.  On degeneracy and invariances of random fields paths with applications in Gaussian process modelling , 2016 .

[41]  Model Comparison and Occam ’ s Razor , 2022 .

[42]  Temple F. Smith Occam's razor , 1980, Nature.

[43]  Bernhard Schölkopf,et al.  Local Group Invariant Representations via Orbit Embeddings , 2016, AISTATS.

[44]  Richard E. Turner,et al.  Two problems with variational expectation maximisation for time-series models , 2011 .

[45]  Michalis K. Titsias,et al.  Variational Learning of Inducing Variables in Sparse Gaussian Processes , 2009, AISTATS.