Gaussian processes for machine learning

Gaussian processes (GPs) provide a principled, practical, probabilistic approach to learning in kernel machines. GPs have received growing attention in the machine learning community over the past decade. The book provides a long-needed, systematic and unified treatment of theoretical and practical aspects of GPs in machine learning. The treatment is comprehensive and self-contained, targeted at researchers and students in machine learning and applied statistics. The book deals with the supervised learning problem for both regression and classification, and includes detailed algorithms. A wide variety of covariance (kernel) functions are presented and their properties discussed. Model selection is discussed both from a Bayesian and classical perspective. Many connections to other well-known techniques from machine learning and statistics are discussed, including support vector machines, neural networks, splines, regularization networks, relevance vector machines and others. Theoretical issues including learning curves and the PAC-Bayesian framework are treated, and several approximation methods for learning with large datasets are discussed. The book contains illustrative examples and exercises. Code and datasets can be obtained on the web. Appendices provide mathematical background and a discussion of Gaussian Markov processes.

[1]  I. J. Schoenberg,et al.  Metric spaces and positive definite functions , 1938 .

[2]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[3]  P. D. Thompson Optimum Smoothing of Two-Dimensional Fields , 1956 .

[4]  P. Mazur On the theory of brownian motion , 1959 .

[5]  Richard Von Mises,et al.  Mathematical Theory of Probability and Statistics , 1966 .

[6]  I J Schoenberg,et al.  SPLINE FUNCTIONS AND THE PROBLEM OF GRADUATION. , 1964, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Norbert Wiener,et al.  Extrapolation, Interpolation, and Smoothing of Stationary Time Series , 1964 .

[8]  H. D. Miller,et al.  The Theory Of Stochastic Processes , 1977, The Mathematical Gazette.

[9]  G. Arfken Mathematical Methods for Physicists , 1967 .

[10]  L. Shepp Radon-Nikodym Derivatives of Gaussian Measures , 1966 .

[11]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[12]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[13]  Thomas Kailath,et al.  RKHS approach to detection and estimation problems-I: Deterministic signals in Gaussian noise , 1971, IEEE Trans. Inf. Theory.

[14]  R. Mazo On the theory of brownian motion , 1973 .

[15]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[16]  Ian F. Blake,et al.  Level-crossing problems for random processes , 1973, IEEE Trans. Inf. Theory.

[17]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[18]  B. Blight,et al.  A Bayesian approach to model inadequacy for polynomial regression , 1975 .

[19]  Jean Duchon,et al.  Splines minimizing rotation-invariant semi-norms in Sobolev spaces , 1976, Constructive Theory of Functions of Several Variables.

[20]  A P Dawid,et al.  Properties of diagnostic data distributions. , 1976, Biometrics.

[21]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[22]  Benoit B. Mandelbrot,et al.  Fractal Geometry of Nature , 1984 .

[23]  B. Silverman,et al.  Density Ratios, Empirical Likelihood and Cot Death , 1978 .

[24]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[25]  A. O'Hagan,et al.  Curve Fitting and Optimal Design for Prediction , 1978 .

[26]  R. Taylor,et al.  The Numerical Treatment of Integral Equations , 1978 .

[27]  S. Geisser,et al.  A Predictive Approach to Model Selection , 1979 .

[28]  G. Wahba,et al.  Design Problems for Optimal Surface Interpolation. , 1979 .

[29]  Eugene Wong,et al.  Stochastic processes in information and dynamical systems , 1979 .

[30]  Chris Chatfield,et al.  The Analysis of Time Series: An Introduction , 1981 .

[31]  Rama Chellappa,et al.  Stochastic models for closed boundary analysis: Representation and reconstruction , 1981, IEEE Trans. Inf. Theory.

[32]  M. Arató Linear Stochastic Systems with Constant Coefficients , 1982 .

[33]  G. Grimmett,et al.  Probability and random processes , 2002 .

[34]  P. Whittle Prediction and Regulation by Linear Least-Square Methods , 1983 .

[35]  Gene H. Golub,et al.  Matrix computations , 1983 .

[36]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[37]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[39]  B. Silverman,et al.  Spline Smoothing: The Equivalent Variable Kernel Method , 1984 .

[40]  D. Cox MULTIVARIATE SMOOTHING SPLINE FUNCTIONS , 1984 .

[41]  B. Silverman,et al.  Some Aspects of the Spline Smoothing Approach to Non‐Parametric Regression Curve Fitting , 1985 .

[42]  G. Wahba A Comparison of GCV and GML for Choosing the Smoothing Parameter in the Generalized Spline Smoothing Problem , 1985 .

[43]  B. Øksendal Stochastic Differential Equations , 1985 .

[44]  B. Yandell,et al.  Automatic Smoothing of Regression Functions in Generalized Linear Models , 1986 .

[45]  D. Freedman,et al.  On the consistency of Bayes estimates , 1986 .

[46]  H. König Eigenvalue Distribution of Compact Operators , 1986 .

[47]  A. Yaglom Correlation Theory of Stationary and Related Random Functions I: Basic Results , 1987 .

[48]  Richard Szeliski,et al.  Regularization Uses Fractal Priors , 1987, AAAI.

[49]  R. Kohn,et al.  A new algorithm for spline smoothing based on smoothing a stochastic process , 1987 .

[50]  Alan L. Yuille,et al.  A regularized solution to edge detection , 1985, J. Complex..

[51]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[52]  D. F. Hays,et al.  Table of Integrals, Series, and Products , 1966 .

[53]  D. L. Hawkins Some practical problems in implementing a certain sieve estimator of the gaussian mean function , 1989 .

[54]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[55]  P. Diggle Time Series: A Biostatistical Introduction , 1990 .

[56]  G. Wahba Spline models for observational data , 1990 .

[57]  D. Cox,et al.  Asymptotic Analysis of Penalized Likelihood and Related Estimators , 1990 .

[58]  Ulf Grenander,et al.  Hands: A Pattern Theoretic Study of Biological Shapes , 1990 .

[59]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[60]  F. Girosi Models of Noise and Robust Estimates , 1991 .

[61]  N. Cressie,et al.  Statistics for Spatial Data. , 1992 .

[62]  R. Daley Atmospheric Data Analysis , 1991 .

[63]  M. Stein A kernel approximation to the kriging predictor of a spatial process , 1991 .

[64]  F. Girosi Models of Noise and Robust Estimation , 1991 .

[65]  Yann LeCun,et al.  Tangent Prop - A Formalism for Specifying Selected Invariances in an Adaptive Network , 1991, NIPS.

[66]  David J. C. MacKay,et al.  The Evidence Framework Applied to Classification Networks , 1992, Neural Computation.

[67]  C. D. Keeling,et al.  Atmospheric CO 2 records from sites in the SIO air sampling network , 1994 .

[68]  B. Silverman,et al.  Nonparametric regression and generalized linear models , 1994 .

[69]  Carl E. Rasmussen,et al.  In Advances in Neural Information Processing Systems , 2011 .

[70]  R. Berk,et al.  Continuous Univariate Distributions, Volume 2 , 1995 .

[71]  Gerhard Winkler,et al.  Image analysis, random fields and dynamic Monte Carlo methods: a mathematical introduction , 1995, Applications of mathematics.

[72]  K. Ritter,et al.  MULTIVARIATE INTEGRATION AND APPROXIMATION FOR RANDOM FIELDS SATISFYING SACKS-YLVISAKER CONDITIONS , 1995 .

[73]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[74]  R. Bartle The elements of integration and Lebesgue measure , 1995 .

[75]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[76]  Feng Gao,et al.  Adaptive Tuning of Numerical Weather Prediction Models: Randomized GCV in Three- and Four-Dimensional Data Assimilation , 1995 .

[77]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[78]  Leszek Plaskota,et al.  Noisy information and computational complexity , 1996 .

[79]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[80]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[81]  P. R. Nelson Continuous Univariate Distributions Volume 2 , 1996 .

[82]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[83]  G. Wahba,et al.  Hybrid Adaptive Splines , 1997 .

[84]  L. K. Hansen,et al.  The Error-Reject Tradeoff , 1997 .

[85]  David Mackay,et al.  Gaussian Processes - A Replacement for Supervised Neural Networks? , 1997 .

[86]  Paul W. Goldberg,et al.  Regression with Input-dependent Noise: A Gaussian Process Treatment , 1997, NIPS.

[87]  Geoffrey E. Hinton,et al.  Evaluation of Gaussian processes and other methods for non-linear regression , 1997 .

[88]  M. Gibbs,et al.  Efficient implementation of gaussian processes , 1997 .

[89]  Radford M. Neal Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classification , 1997, physics/9701026.

[90]  Christopher K. I. Williams,et al.  Gaussian regression and optimal finite dimensional linear models , 1997 .

[91]  Christopher K. I. Williams,et al.  Discovering Hidden Features with Gaussian Processes Regression , 1998, NIPS.

[92]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[93]  Christopher K. I. Williams Prediction with Gaussian Processes: From Linear Regression to Linear Prediction and Beyond , 1999, Learning in Graphical Models.

[94]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[95]  D. Mackay,et al.  Introduction to Gaussian processes , 1998 .

[96]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[97]  Peter Sollich,et al.  Learning Curves for Gaussian Processes , 1998, NIPS.

[98]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[99]  Christopher K. I. Williams Computation with Infinite Neural Networks , 1998, Neural Computation.

[100]  James O. Berger,et al.  Uncertainty analysis and other inference tools for complex computer codes , 1998 .

[101]  Manfred Opper,et al.  Finite-Dimensional Approximation of Gaussian Processes , 1998, NIPS.

[102]  Manfred Opper,et al.  General Bounds on Bayes Errors for Regression with Gaussian Processes , 1998, NIPS.

[103]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[104]  Sally Wood,et al.  A Bayesian Approach to Robust Binary Nonparametric Regression , 1998 .

[105]  David Barber,et al.  Bayesian Classification With Gaussian Processes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[106]  D. Freedman On the Bernstein-von Mises Theorem with Infinite Dimensional Parameters , 1999 .

[107]  J. Weston,et al.  Support vector regression with ANOVA decomposition kernels , 1999 .

[108]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[109]  Matthias W. Seeger,et al.  Bayesian Model Selection for Support Vector Machines, Gaussian Processes and Other Kernel Classifiers , 1999, NIPS.

[110]  C. Watkins Dynamic Alignment Kernels , 1999 .

[111]  David Haussler,et al.  Probabilistic kernel regression models , 1999, AISTATS.

[112]  David J. C. MacKay,et al.  Comparison of Approximate Methods for Handling Hyperparameters , 1999, Neural Computation.

[113]  Xiwu Lin,et al.  Smoothing spline ANOVA models for large data sets with Bernoulli observations and the randomized GACV , 2000 .

[114]  Massimiliano Pontil,et al.  On the Noise Model of Support Vector Machines Regression , 2000, ALT.

[115]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[116]  D. Kammler A First Course in Fourier Analysis , 2000 .

[117]  Carl E. Rasmussen,et al.  Occam's Razor , 2000, NIPS.

[118]  David J. C. MacKay,et al.  Variational Gaussian process classifiers , 2000, IEEE Trans. Neural Networks Learn. Syst..

[119]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[120]  Alexander J. Smola,et al.  Sparse Greedy Gaussian Process Regression , 2000, NIPS.

[121]  B. Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, ICML.

[122]  Klaus Ritter,et al.  Average-case analysis of numerical problems , 2000, Lecture notes in mathematics.

[123]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[124]  P. Bartlett,et al.  Probabilities for SV Machines , 2000 .

[125]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[126]  Ole Winther,et al.  Gaussian Processes for Classification: Mean-Field Algorithms , 2000, Neural Computation.

[127]  Bernhard Schölkopf,et al.  Dynamic Alignment Kernels , 2000 .

[128]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[129]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[130]  Michael E. Tipping Sparse Bayesian Learning and the Relevance Vector Machine , 2001, J. Mach. Learn. Res..

[131]  Manfred Opper,et al.  A Variational Approach to Learning Curves , 2001, NIPS.

[132]  Carl E. Rasmussen,et al.  Infinite Mixtures of Gaussian Process Experts , 2001, NIPS.

[133]  Jitendra Malik,et al.  Efficient spatiotemporal grouping using the Nystrom method , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[134]  Samy Bengio,et al.  SVMTorch: Support Vector Machines for Large-Scale Regression Problems , 2001, J. Mach. Learn. Res..

[135]  Ji Zhu,et al.  Kernel Logistic Regression and the Import Vector Machine , 2001, NIPS.

[136]  Ole Winther,et al.  TAP Gibbs Free Energy, Belief Propagation and Sparsity , 2001, NIPS.

[137]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[138]  Donald R. Jones,et al.  A Taxonomy of Global Optimization Methods Based on Response Surfaces , 2001, J. Glob. Optim..

[139]  S. Sundararajan,et al.  Predictive Approaches for Choosing Hyperparameters in Gaussian Processes , 1999, Neural Computation.

[140]  Tom Minka,et al.  A family of algorithms for approximate Bayesian inference , 2001 .

[141]  Roderick Murray-Smith,et al.  Gaussian process priors with ARMA noise models , 2001 .

[142]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[143]  Michael E. Tipping,et al.  Analysis of Sparse Bayesian Learning , 2001, NIPS.

[144]  John Shawe-Taylor,et al.  String Kernels, Fisher Kernels and Finite State Automata , 2002, NIPS.

[145]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[146]  John Shawe-Taylor,et al.  The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum , 2002, NIPS.

[147]  Matthias W. Seeger,et al.  PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[148]  Carl E. Rasmussen,et al.  Derivative Observations in Gaussian Process Models of Dynamic Systems , 2002, NIPS.

[149]  Christopher K. I. Williams,et al.  Modelling Frontal Discontinuities in Wind Fields , 2002 .

[150]  Lehel Csató,et al.  Sparse On-Line Gaussian Processes , 2002, Neural Computation.

[151]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[152]  Gunnar Rätsch,et al.  A New Discriminative Kernel from Probabilistic Models , 2001, Neural Computation.

[153]  Carl E. Rasmussen,et al.  Bayesian Monte Carlo , 2002, NIPS.

[154]  Carl Edward Rasmussen,et al.  Observations on the Nyström Method for Gaussian Process Prediction , 2002 .

[155]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machines , 2002 .

[156]  C. Rasmussen,et al.  Gaussian Process Priors with Uncertain Inputs - Application to Multiple-Step Ahead Time Series Forecasting , 2002, NIPS.

[157]  William H. Press,et al.  Numerical recipes in C , 2002 .

[158]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[159]  Stefan Schaal,et al.  Statistical Learning for Humanoid Robots , 2002, Auton. Robots.

[160]  Anton Schwaighofer,et al.  Transductive and Inductive Methods for Approximate Gaussian Process Regression , 2002, NIPS.

[161]  Neil D. Lawrence,et al.  Fast Sparse Gaussian Process Methods: The Informative Vector Machine , 2002, NIPS.

[162]  Neil D. Lawrence,et al.  Fast Forward Selection to Speed Up Sparse Gaussian Process Regression , 2003, AISTATS.

[163]  Mark J. Schervish,et al.  Nonstationary Covariance Functions for Gaussian Process Regression , 2003, NIPS.

[164]  Thomas J. Santner,et al.  Design and analysis of computer experiments , 1998 .

[165]  A. P. Dawid,et al.  Gaussian Processes to Speed up Hybrid Monte Carlo for Expensive Bayesian Integrals , 2003 .

[166]  Matthias W. Seeger,et al.  Bayesian Gaussian process models : PAC-Bayesian generalisation error bounds and sparse approximations , 2003 .

[167]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[168]  Ron Meir,et al.  Generalization Error Bounds for Bayesian Mixture Algorithms , 2003, J. Mach. Learn. Res..

[169]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[170]  Neil D. Lawrence,et al.  Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data , 2003, NIPS.

[171]  Michael E. Tipping,et al.  Fast Marginal Likelihood Maximisation for Sparse Bayesian Models , 2003 .

[172]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[173]  Michael I. Jordan,et al.  Sparse Gaussian Process Classification With Multiple Classes , 2004 .

[174]  Charles A. Micchelli,et al.  Kernels for Multi--task Learning , 2004, NIPS.

[175]  David A. McAllester PAC-Bayesian Stochastic Model Selection , 2003, Machine Learning.

[176]  Larry S. Davis,et al.  Efficient Kernel Machines Using the Improved Fast Gauss Transform , 2004, NIPS.

[177]  Holger Wendland,et al.  Scattered Data Approximation: Conditionally positive definite functions , 2004 .

[178]  Matthias W. Seeger,et al.  Gaussian Processes For Machine Learning , 2004, Int. J. Neural Syst..

[179]  Joaquin Quiñonero-Candela,et al.  Learning with Uncertainty: Gaussian Processes and Relevance Vector Machines , 2004 .

[180]  Marcus R. Frean,et al.  Dependent Gaussian Processes , 2004, NIPS.

[181]  M. Schervish,et al.  Posterior Consistency in Nonparametric Regression Problems under Gaussian Process Priors , 2004 .

[182]  John Langford,et al.  Suboptimal Behavior of Bayes and MDL in Classification Under Misspecification , 2004, COLT.

[183]  Christopher K. I. Williams,et al.  Using the Equivalent Kernel to Understand Gaussian Process Regression , 2004, NIPS.

[184]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[185]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[186]  Baver Okutmustur Reproducing kernel Hilbert spaces , 2005 .

[187]  Carl E. Rasmussen,et al.  Assessing Approximations for Gaussian Process Classification , 2005, NIPS.

[188]  Wei Chu,et al.  Gaussian Processes for Ordinal Regression , 2005, J. Mach. Learn. Res..

[189]  M. Seeger Expectation Propagation for Exponential Families , 2005 .

[190]  Carl E. Rasmussen,et al.  Healing the relevance vector machine through augmentation , 2005, ICML.

[191]  Stefan Schaal,et al.  Incremental Online Learning in High Dimensions , 2005, Neural Computation.

[192]  Yee Whye Teh,et al.  Semiparametric latent factor models , 2005, AISTATS.

[193]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[194]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[195]  S. Ghosal,et al.  Nonparametric binary regression using a Gaussian process prior , 2007 .

[196]  J. K. Hunter,et al.  Measure Theory , 2007 .

[197]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[198]  John Langford,et al.  Suboptimal behavior of Bayes and MDL in classification under misspecification , 2004, Machine Learning.

[199]  Radford M. Neal Regression and Classification Using Gaussian Process Priors , 2009 .

[200]  A. P. Dawid,et al.  Regression and Classification Using Gaussian Process Priors , 2009 .

[201]  Zhe Jiang,et al.  Spatial Statistics , 2013 .

[202]  C. Priebe Adaptive Mixtures , 2010 .

[203]  Sonja Kuhnt,et al.  Design and analysis of computer experiments , 2010 .