An Online Projection Estimator for Nonparametric Regression in Reproducing Kernel Hilbert Spaces.

: The goal of nonparametric regression is to recover an underlying regression function from noisy observations, under the assumption that the regression function belongs to a prespecified infinite-dimensional function space. In the online setting, in which the observations come in a stream, it is generally computationally infeasible to refit the whole model repeatedly. As yet, there are no methods that are both computationally efficient and statistically rate optimal. In this paper, we propose an estimator for online nonparametric regression. Notably, our estimator is an empirical risk minimizer in a deterministic linear space, which is quite different from existing methods that use random features and a functional stochastic gradient. Our theoretical analysis shows that this estimator obtains a rate-optimal generalization error when the regression function is known to live in a reproducing kernel Hilbert space. We also show, theoretically and empirically, that the computational cost of our estimator is much lower than that of other rate-optimal estimators proposed for this online setting.

[1]  Xiaolin Huang,et al.  Random Features for Kernel Approximation: A Survey on Algorithms, Theory, and Beyond , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Ali Shojaie,et al.  Convergence Rates of Nonparametric Penalized Regression under Misspecified Smoothness , 2021 .

[3]  G. A. Young,et al.  High‐dimensional Statistics: A Non‐asymptotic Viewpoint, Martin J.Wainwright, Cambridge University Press, 2019, xvii 552 pages, £57.99, hardback ISBN: 978‐1‐1084‐9802‐9 , 2020, International Statistical Review.

[4]  Panayot S. Vassilevski,et al.  Eigenvalue Problems for Exponential-Type Kernels , 2019, Comput. Methods Appl. Math..

[5]  Shiyuan Wang,et al.  The Online Random Fourier Features Conjugate Gradient Algorithm , 2019, IEEE Signal Processing Letters.

[6]  J. Wellner,et al.  Convergence rates of least squares regression estimators with heavy-tailed errors , 2017, The Annals of Statistics.

[7]  R. Tibshirani,et al.  Additive models with trend filtering , 2017, The Annals of Statistics.

[8]  Francis Bach,et al.  Constant Step Size Stochastic Gradient Descent for Probabilistic Modeling , 2018, UAI.

[9]  Yang Li,et al.  Nonlinear Online Learning with Adaptive Nyström Approximation , 2018, ArXiv.

[10]  Alejandro Ribeiro,et al.  Parsimonious Online Learning with Kernels via sparse projections in function space , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Lorenzo Rosasco,et al.  Generalization Properties of Learning with Random Features , 2016, NIPS.

[12]  Sanjiv Kumar,et al.  Orthogonal Random Features , 2016, NIPS.

[13]  Robert Schaback,et al.  Approximation of eigenfunctions in kernel-based spaces , 2014, Adv. Comput. Math..

[14]  Michael W. Mahoney,et al.  Revisiting the Nystrom Method for Improved Large-scale Machine Learning , 2013, J. Mach. Learn. Res..

[15]  Steven C. H. Hoi,et al.  Large Scale Online Kernel Learning , 2016, J. Mach. Learn. Res..

[16]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[17]  Gregory E. Fasshauer,et al.  Kernel-based Approximation Methods using MATLAB , 2015, Interdisciplinary Mathematical Sciences.

[18]  Ming Yuan,et al.  Minimax Optimal Rates of Estimation in High Dimensional Additive Models: Universal Phase Transition , 2015, ArXiv.

[19]  Sham M. Kakade,et al.  Competing with the Empirical Risk Minimizer in a Single Pass , 2014, COLT.

[20]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[21]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[22]  Yuan Yao,et al.  Online Learning as Stochastic Approximation of Regularization Paths: Optimality and Almost-Sure Convergence , 2011, IEEE Transactions on Information Theory.

[23]  Zhiyu Liang,et al.  Eigen-analysis of kernel operators for nonlinear dimension reduction and discrimination , 2014 .

[24]  Wolfgang Heardle et al. Wavelets, approximation, and statistical applications , 2013 .

[25]  Rodney A. Kennedy,et al.  Classification and construction of closed-form kernels for signal representation on the 2-sphere , 2013, Optics & Photonics - Optical Engineering + Applications.

[26]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[27]  Achilleas Zapranis,et al.  Wavelet Neural Networks: A Practical Guide , 2011, Neural Networks.

[28]  V. Michel Lectures on Constructive Approximation: Fourier, Spline, and Wavelet Methods on the Real Line, the Sphere, and the Ball , 2012 .

[29]  Gregory E. Fasshauer,et al.  Green’s Functions: Taking Another Look at Kernel Approximation, RadialBasis Functions, and Splines , 2012 .

[30]  A. Belloni,et al.  Pivotal estimation via square-root Lasso in nonparametric regression , 2011, 1105.1475.

[31]  A. W. van der Vaart,et al.  A local maximal inequality under uniform entropy. , 2010, Electronic journal of statistics.

[32]  D. Xiu Numerical Methods for Stochastic Computations: A Spectral Method Approach , 2010 .

[33]  Martin J. Wainwright,et al.  Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness , 2009, NIPS.

[34]  G. Leoni A First Course in Sobolev Spaces , 2009 .

[35]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[36]  Mikhail Belkin,et al.  DATA SPECTROSCOPY: EIGENSPACES OF CONVOLUTION OPERATORS AND CLUSTERING , 2008, 0807.3719.

[37]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[38]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[39]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[40]  Bengt Fornberg,et al.  A Stable Algorithm for Flat Radial Basis Functions on a Sphere , 2007, SIAM J. Sci. Comput..

[41]  Yiming Ying,et al.  Online Regularized Classification Algorithms , 2006, IEEE Transactions on Information Theory.

[42]  Roland Opfer,et al.  Multiscale kernels , 2006, Adv. Comput. Math..

[43]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[44]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[45]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[46]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[47]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[48]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[49]  Christopher K. I. Williams,et al.  The Effect of the Input Density Distribution on Kernel-based Classifiers , 2000, ICML.

[50]  J. Cima,et al.  On weak* convergence in ¹ , 1996 .

[51]  Gaston H. Gonnet,et al.  Advances in Computational Mathematics , 1996 .

[52]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[53]  G. Wahba Spline Models for Observational Data , 1990 .

[54]  P. Kumar,et al.  Theory and practice of recursive identification , 1985, IEEE Transactions on Automatic Control.

[55]  Numerical solution for eigenvalues and eigenfunctions of a Hermitian kernel and an error estimate , 1975 .

[56]  J. Sherman,et al.  Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix , 1950 .