Frequentist coverage and sup-norm convergence rate in Gaussian process regression

Gaussian process (GP) regression is a powerful interpolation technique due to its flexibility in capturing non-linearity. In this paper, we provide a general framework for understanding the frequentist coverage of point-wise and simultaneous Bayesian credible sets in GP regression. As an intermediate result, we develop a Bernstein von-Mises type result under supremum norm in random design GP regression. Identifying both the mean and covariance function of the posterior distribution of the Gaussian process as regularized $M$-estimators, we show that the sampling distribution of the posterior mean function and the centered posterior distribution can be respectively approximated by two population level GPs. By developing a comparison inequality between two GPs, we provide exact characterization of frequentist coverage probabilities of Bayesian point-wise credible intervals and simultaneous credible bands of the regression function. Our results show that inference based on GP regression tends to be conservative; when the prior is under-smoothed, the resulting credible intervals and bands have minimax-optimal sizes, with their frequentist coverage converging to a non-degenerate value between their nominal level and one. As a byproduct of our theory, we show that the GP regression also yields minimax-optimal posterior contraction rate relative to the supremum norm, which provides a positive evidence to the long standing problem on optimal supremum norm contraction rate in GP regression.

[1]  H. Leahu On the Bernstein-von Mises phenomenon in the Gaussian white noise model , 2011 .

[2]  S. Chatterjee An error bound in the Sudakov-Fernique inequality , 2005, math/0510424.

[3]  Kengo Kato,et al.  Gaussian approximation of suprema of empirical processes , 2014 .

[4]  Yun Yang,et al.  Non-asymptotic theory for nonparametric testing , 2017, 1702.01330.

[5]  Subhashis Ghosal,et al.  Supremum Norm Posterior Contraction and Credible Sets for Nonparametric Multivariate Regression , 2014, 1411.6716.

[6]  I. Castillo On Bayesian supremum norm contraction rates , 2013, 1304.1761.

[7]  B. Silverman,et al.  Spline Smoothing: The Equivalent Variable Kernel Method , 1984 .

[8]  Debdeep Pati,et al.  ANISOTROPIC FUNCTION ESTIMATION USING MULTI-BANDWIDTH GAUSSIAN PROCESSES. , 2011, Annals of statistics.

[9]  Daniel J. Hsu,et al.  Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operators , 2017 .

[10]  Evgeny Burnaev,et al.  Adaptive Design of Experiments Based on Gaussian Processes , 2015, SLDS.

[11]  Daniel Jonathan Scansaroli Stochastic Modeling with Temporally Dependent Gaussian Processes: Applications to Financial Engineering, Pricing and Risk Management , 2012 .

[12]  A. Bhattacharya,et al.  Posterior contraction in Gaussian process regression using Wasserstein approximations , 2015, 1502.02336.

[13]  T. J. Mitchell,et al.  Bayesian Prediction of Deterministic Functions, with Applications to the Design and Analysis of Computer Experiments , 1991 .

[14]  A. O'Hagan,et al.  Bayesian calibration of computer models , 2001 .

[15]  L. Brown,et al.  A constrained risk inequality with applications to nonparametric functional estimation , 1996 .

[16]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[17]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[18]  A. V. D. Vaart,et al.  Credible sets in the fixed design model with Brownian motion prior , 2015 .

[19]  Shahar Mendelson,et al.  Geometric Parameters of Kernel Machines , 2002, COLT.

[20]  G. Stupfler On the weak convergence of the kernel density estimator in the uniform topology , 2016 .

[21]  Harry van Zanten,et al.  Information Rates of Nonparametric Gaussian Process Methods , 2011, J. Mach. Learn. Res..

[22]  Yannick Baraud,et al.  A Bernstein-type inequality for suprema of random processes with applications to model selection in non-Gaussian regression , 2009, 0909.1863.

[23]  Roger Woodard,et al.  Interpolation of Spatial Data: Some Theory for Kriging , 1999, Technometrics.

[24]  R. Nickl,et al.  On the Bernstein–von Mises phenomenon for nonparametric Bayes procedures , 2013, 1310.2484.

[25]  Judith Rousseau,et al.  On adaptive posterior concentration rates , 2013, 1305.5270.

[26]  Noel A Cressie,et al.  Statistics for Spatial Data. , 1992 .

[27]  A. W. Vaart,et al.  Frequentist coverage of adaptive nonparametric Bayesian credible sets , 2013, 1310.4489.

[28]  Van Der Vaart,et al.  Rates of contraction of posterior distributions based on Gaussian process priors , 2008 .

[29]  Impossibility of weak convergence of kernel density estimators to a non-degenerate law in L 2(ℝ d ) , 2011 .

[30]  Tong Zhang,et al.  Learning Bounds for Kernel Regression Using Effective Data Dimensionality , 2005, Neural Computation.

[31]  A. V. D. Vaart,et al.  BAYESIAN INVERSE PROBLEMS WITH GAUSSIAN PRIORS , 2011, 1103.2692.

[32]  Kengo Kato,et al.  Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors , 2012, 1212.6906.

[33]  Kolyan Ray Adaptive Bernstein–von Mises theorems in Gaussian white noise , 2014, 1407.3397.

[34]  D. Cox An Analysis of Bayesian Inference for Nonparametric Regression , 1993 .

[35]  A. Gelfand,et al.  Gaussian predictive process models for large spatial data sets , 2008, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[36]  Tatyana Krivobokova,et al.  Adaptive empirical Bayesian smoothing splines , 2014, 1411.6860.

[37]  Thomas J. Santner,et al.  Design and analysis of computer experiments , 1998 .

[38]  Richard Nickl,et al.  Rates of contraction for posterior distributions in Lr-metrics, 1 ≤ r ≤ ∞ , 2011, 1203.2043.

[39]  D. Pollard Asymptotics via Empirical Processes , 1989 .

[40]  Victor Chernozhukov,et al.  Anti-concentration and honest, adaptive confidence bands , 2013 .

[41]  Thomas J. Santner,et al.  The Design and Analysis of Computer Experiments , 2003, Springer Series in Statistics.

[42]  D. Dunson,et al.  Bayesian Manifold Regression , 2013, 1305.0617.

[43]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[44]  Martin J. Wainwright,et al.  Randomized sketches for kernels: Fast and optimal non-parametric regression , 2015, ArXiv.

[45]  C. Butucea Exact adaptive pointwise estimation on Sobolev classes of densities , 2001 .

[46]  G. Matheron The intrinsic random functions and their applications , 1973, Advances in Applied Probability.

[47]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[48]  D. Pati,et al.  Bayesian model selection consistency and oracle inequality with intractable marginal likelihood , 2017, 1701.00311.

[49]  Richard Nickl,et al.  Uniform central limit theorems for kernel density estimators , 2008 .

[50]  I. Johnstone High dimensional Bernstein-von Mises: simple examples. , 2010, Institute of Mathematical Statistics collections.

[51]  Van Der Vaart,et al.  Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth , 2009, 0908.3556.

[52]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[53]  R. Nickl,et al.  Nonparametric Bernstein–von Mises theorems in Gaussian white noise , 2012, 1208.3862.