Learning with SGD and Random Features

Sketching and stochastic gradient methods are arguably the most common techniques to derive efficient large scale learning algorithms. In this paper, we investigate their application in the context of nonparametric statistical learning. More precisely, we study the estimator defined by stochastic gradient with mini batches and random features. The latter can be seen as form of nonlinear sketching and used to define approximate kernel methods. The considered estimator is not explicitly penalized/constrained and regularization is implicit. Indeed, our study highlights how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions. We do this by deriving optimal finite sample bounds, under standard assumptions. The obtained results are corroborated and illustrated by numerical experiments.

[1]  Francis R. Bach,et al.  On Structured Prediction Theory with Calibrated Convex Surrogate Losses , 2017, NIPS.

[2]  Benjamin Recht,et al.  Large Scale Kernel Learning using Block Coordinate Descent , 2016, ArXiv.

[3]  Alessandro Rudi,et al.  Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes , 2018, NeurIPS.

[4]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[5]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[6]  Volkan Cevher,et al.  Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces , 2018, Applied and Computational Harmonic Analysis.

[7]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[8]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[9]  Daniele Calandriello,et al.  On Fast Leverage Score Sampling and Optimal Learning , 2018, NeurIPS.

[10]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[11]  Lorenzo Rosasco,et al.  Learning with Incremental Iterative Regularization , 2014, NIPS.

[12]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[13]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[14]  Lorenzo Rosasco,et al.  Consistent Multitask Learning with Nonlinear Output Relations , 2017, NIPS.

[15]  Lorenzo Rosasco,et al.  Generalization Properties of Doubly Online Learning Algorithms , 2018, J. Complex..

[16]  Lorenzo Rosasco,et al.  A Consistent Regularization Approach for Structured Prediction , 2016, NIPS.

[17]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[18]  Martin J. Wainwright,et al.  Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions , 2012, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[19]  Martin J. Wainwright,et al.  Randomized sketches for kernels: Fast and optimal non-parametric regression , 2015, ArXiv.

[20]  Lorenzo Rosasco,et al.  Learning from Examples as an Inverse Problem , 2005, J. Mach. Learn. Res..

[21]  Lorenzo Rosasco,et al.  FALKON: An Optimal Large Scale Kernel Method , 2017, NIPS.

[22]  Lorenzo Rosasco,et al.  On the Sample Complexity of Subspace Learning , 2013, NIPS.

[23]  S. Smale,et al.  ESTIMATING THE APPROXIMATION ERROR IN LEARNING THEORY , 2003 .

[24]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[25]  Lorenzo Rosasco,et al.  Optimal Rates for Learning with Nyström Stochastic Gradient Methods , 2017, ArXiv.

[26]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[27]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[28]  Lorenzo Rosasco,et al.  Less is More: Nyström Computational Regularization , 2015, NIPS.

[29]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[30]  Alexander J. Smola,et al.  Fastfood: Approximate Kernel Expansions in Loglinear Time , 2014, ArXiv.

[31]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[32]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[33]  Lorenzo Rosasco,et al.  NYTRO: When Subsampling Meets Early Stopping , 2015, AISTATS.

[34]  Zoltán Szabó,et al.  Optimal Rates for Random Fourier Features , 2015, NIPS.

[35]  Lorenzo Rosasco,et al.  Generalization Properties of Learning with Random Features , 2016, NIPS.

[36]  David P. Woodruff,et al.  Sketching Structured Matrices for Faster Nonlinear Regression , 2013, NIPS.

[37]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[38]  Sanjiv Kumar,et al.  Orthogonal Random Features , 2016, NIPS.

[39]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[40]  Francesco Orabona,et al.  Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning , 2014, NIPS.

[41]  H. Robbins A Stochastic Approximation Method , 1951 .

[42]  Dennis DeCoste,et al.  Compact Random Feature Maps , 2013, ICML.

[43]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[44]  Alessandro Rudi,et al.  Exponential convergence of testing error for stochastic gradient methods , 2017, COLT.

[45]  Lorenzo Rosasco,et al.  Optimal Rates for Multi-pass Stochastic Gradient Methods , 2016, J. Mach. Learn. Res..

[46]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[47]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[48]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[49]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[50]  Alessandro Rudi,et al.  Localized Structured Prediction , 2018, NeurIPS.

[51]  Francis R. Bach,et al.  Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression , 2016, J. Mach. Learn. Res..

[52]  Francis R. Bach,et al.  On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions , 2015, J. Mach. Learn. Res..

[53]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.