Kernel Identification Through Transformers

Kernel selection plays a central role in determining the performance of Gaussian Process (GP) models, as the chosen kernel determines both the inductive biases and prior support of functions under the GP prior. This work addresses the challenge of constructing custom kernel functions for high-dimensional GP regression models. Drawing inspiration from recent progress in deep learning, we introduce a novel approach named KITT: Kernel Identification Through Transformers. KITT exploits a transformer-based architecture to generate kernel recommendations in under 0.1 seconds, which is several orders of magnitude faster than conventional kernel search algorithms. We train our model using synthetic data generated from priors over a vocabulary of known kernels. By exploiting the nature of the selfattention mechanism, KITT is able to process datasets with inputs of arbitrary dimension. We demonstrate that kernels chosen by KITT yield strong performance over a diverse collection of regression benchmarks.

[1]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[2]  Yee Whye Teh,et al.  Set Transformer , 2018, ICML.

[3]  Andrew Gordon Wilson,et al.  Deep Kernel Learning , 2015, AISTATS.

[4]  Ryan P. Adams,et al.  Task-Agnostic Amortized Inference of Gaussian Process Hyperparameters , 2020, NeurIPS.

[5]  Carl E. Rasmussen,et al.  Additive Gaussian Processes , 2011, NIPS.

[6]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[7]  Guodong Zhang,et al.  Differentiable Compositional Kernel Learning for Gaussian Processes , 2018, ICML.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[10]  Yee Whye Teh,et al.  Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes , 2017, AISTATS.

[11]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[12]  Carl E. Rasmussen,et al.  Manifold Gaussian Processes for regression , 2014, 2016 International Joint Conference on Neural Networks (IJCNN).

[13]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[14]  Geoffrey E. Hinton,et al.  Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes , 2007, NIPS.

[15]  Zhou Yu,et al.  Multimodal Transformer With Multi-View Visual Representation for Image Captioning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  David Duvenaud,et al.  Automatic model construction with Gaussian processes , 2014 .

[17]  R. Fletcher Practical Methods of Optimization , 1988 .

[18]  Joshua B. Tenenbaum,et al.  Structure Discovery in Nonparametric Regression through Compositional Kernel Search , 2013, ICML.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[21]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[22]  Carl E. Rasmussen,et al.  New Directions for Learning with Kernels and Gaussian Processes (Dagstuhl Seminar 16481) , 2016, Dagstuhl Reports.

[23]  Alexander J. Smola,et al.  Deep Sets , 2017, 1703.06114.

[24]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[25]  Ruslan Salakhutdinov,et al.  Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation , 2021, ICLR.

[26]  Carl E. Rasmussen,et al.  The Promises and Pitfalls of Deep Kernel Learning , 2021, UAI.

[27]  Alexis Boukouvalas,et al.  The Minecraft Kernel: Modelling correlated Gaussian Processes in the Fourier domain , 2021, AISTATS.

[28]  Francis R. Bach,et al.  High-Dimensional Non-Linear Variable Selection through Hierarchical Kernel Learning , 2009, ArXiv.

[29]  Rita Cucchiara,et al.  Meshed-Memory Transformer for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).