Sparse gaussian processes for large-scale machine learning

Gaussian Processes (GPs) are non-parametric, Bayesian models able to achieve state-of-the-art performance in supervised learning tasks such as non-linear regression and classification, thus being used as building blocks for more sophisticated machine learning applications. GPs also enjoy a number of other desirable properties: They are virtually overfitting-free, have sound and convenient model selection procedures, and provide so-called “error bars”, i.e., estimations of their predictions’ uncertainty. Unfortunately, full GPs cannot be directly applied to real-world, large-scale data sets due to their high computational cost. For n data samples, training a GP requires O(n3) computation time, which renders modern desktop computers unable to handle databases with more than a few thousand instances. Several sparse approximations that scale linearly with the number of data samples have been recently proposed, with the Sparse Pseudo-inputs GP (SPGP) representing the current state of the art. Sparse GP approximations can be used to deal with large databases, but, of course, do not usually achieve the performance of full GPs. In this thesis we present several novel sparse GP models that compare favorably with SPGP, both in terms of predictive performance and error bar quality. Our models converge to the full GP under some conditions, but our goal is not so much to faithfully approximate full GPs as it is to develop useful models that provide high-quality probabilistic predictions. By doing so, even full GPs are occasionally outperformed. We provide two broad classes of models: Marginalized Networks (MNs) and Inter- Domain GPs (IDGPs). MNs can be seen as models that lie in between classical Neural Networks (NNs) and full GPs, trying to combine the advantages of both. Though trained differently, when used for prediction they retain the structure of classical NNs, so they can be interpreted as a novel way to train a classical NN, while adding the benefit of input-dependent error bars and overfitting resistance. IDGPs generalize SPGP by allowing the “pseudo-inputs” to lie in a different domain, thus adding extra flexibility and performance. Furthermore, they provide a convenient probabilistic framework in which previous sparse methods can be more easily understood. All the proposed algorithms are tested and compared with the current state of the art on several standard, large-scale data sets with different properties Their strengths and weaknesses are also discussed and compared, so that it is easier to select the best suited candidate for each potential application.

[1]  Luís Torgo,et al.  Clustered Partial Linear Regression , 2000, Machine Learning.

[2]  M. Lázaro-Gredilla Sparse Spectral Sampling Gaussian Processes , 2007 .

[3]  Volker Tresp,et al.  A Bayesian Committee Machine , 2000, Neural Computation.

[4]  Malte Kuß,et al.  Gaussian process models for robust regression, classification, and reinforcement learning , 2006 .

[5]  G. Micula,et al.  Numerical Treatment of the Integral Equations , 1999 .

[6]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[7]  D. Harville Matrix Algebra From a Statistician's Perspective , 1998 .

[8]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[9]  Yuan Qi,et al.  Bayesian spectrum estimation of unevenly sampled nonstationary data , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[11]  Andries Petrus Engelbrecht,et al.  Evolving model trees for mining data sets with continuous-valued classes , 2008, Expert Syst. Appl..

[12]  David Barber,et al.  Bayesian Classification With Gaussian Processes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[14]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[15]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[16]  Carl E. Rasmussen,et al.  A Unifying View of Sparse Approximate Gaussian Process Regression , 2005, J. Mach. Learn. Res..

[17]  Tao Chen,et al.  Bagging for Gaussian process regression , 2009, Neurocomputing.

[18]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[19]  G. Matheron The intrinsic random functions and their applications , 1973, Advances in Applied Probability.

[20]  Geoffrey E. Hinton,et al.  Evaluation of Gaussian processes and other methods for non-linear regression , 1997 .

[21]  Sholom M. Weiss,et al.  Rule-based Machine Learning Methods for Functional Prediction , 1995, J. Artif. Intell. Res..

[22]  Radford M. Neal Bayesian Learning via Stochastic Dynamics , 1992, NIPS.

[23]  M. Opper Sparse Online Gaussian Processes , 2008 .

[24]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[25]  H. Luetkepohl The Handbook of Matrices , 1996 .

[26]  P. K. Chaturvedi,et al.  Communication Systems , 2002, IFIP — The International Federation for Information Processing.

[27]  A. O'Hagan,et al.  Curve Fitting and Optimal Design for Prediction , 1978 .

[28]  Carl E. Rasmussen,et al.  Assessing Approximate Inference for Binary Gaussian Process Classification , 2005, J. Mach. Learn. Res..

[29]  Radford M. Neal Bayesian training of backpropagation networks by the hybrid Monte-Carlo method , 1992 .

[30]  Carl E. Rasmussen,et al.  In Advances in Neural Information Processing Systems , 2011 .

[31]  Carl E. Rasmussen,et al.  Assessing Approximations for Gaussian Process Classification , 2005, NIPS.

[32]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[33]  Neil D. Lawrence,et al.  Fast Sparse Gaussian Process Methods: The Informative Vector Machine , 2002, NIPS.

[34]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[35]  Neil D. Lawrence,et al.  Sparse Convolved Gaussian Processes for Multi-output Regression , 2008, NIPS.

[36]  B. Silverman,et al.  Some Aspects of the Spline Smoothing Approach to Non‐Parametric Regression Curve Fitting , 1985 .

[37]  Michalis K. Titsias,et al.  Variational Learning of Inducing Variables in Sparse Gaussian Processes , 2009, AISTATS.

[38]  Sean B. Holden,et al.  The Generalized FITC Approximation , 2007, NIPS.

[39]  Walter L. Smith Probability and Statistics , 1959, Nature.

[40]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[41]  G. L. Bretthorst Nonuniform sampling: Bandwidth and aliasing , 2001 .

[42]  Edwin V. Bonilla,et al.  Multi-task Gaussian Process Prediction , 2007, NIPS.

[43]  Norbert Wiener,et al.  Extrapolation, Interpolation, and Smoothing of Stationary Time Series , 1964 .

[44]  George M. Siouris,et al.  Applied Optimal Control: Optimization, Estimation, and Control , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[45]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[46]  Alexander J. Smola,et al.  Sparse Greedy Gaussian Process Regression , 2000, NIPS.

[47]  Johannes Fürnkranz,et al.  Pairwise Classification as an Ensemble Technique , 2002, ECML.

[48]  Andrew Y. Ng,et al.  Fast Gaussian Process Regression using KD-Trees , 2005, NIPS.

[49]  William H. Press,et al.  Numerical recipes in C , 2002 .

[50]  Christopher K. I. Williams Computing with Infinite Networks , 1996, NIPS.

[51]  Kurt Hornik,et al.  Some new results on neural network approximation , 1993, Neural Networks.

[52]  Neil D. Lawrence,et al.  Fast Forward Selection to Speed Up Sparse Gaussian Process Regression , 2003, AISTATS.

[53]  Aníbal R. Figueiras-Vidal,et al.  Inter-domain Gaussian Processes for Sparse Inference using Inducing Features , 2009, NIPS.

[54]  Larry S. Davis,et al.  Efficient Kernel Machines Using the Improved Fast Gauss Transform , 2004, NIPS.

[55]  Edward Lloyd Snelson,et al.  Flexible and efficient Gaussian process models for machine learning , 2007 .

[56]  Bernhard Schölkopf,et al.  Sparse multiscale gaussian process regression , 2008, ICML '08.

[57]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[58]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.