Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms

We study generalization properties of distributed algorithms in the setting of nonparametric regression over a reproducing kernel Hilbert space (RKHS). We first investigate distributed stochastic gradient methods (SGM), with mini-batches and multi-passes over the data. We show that optimal generalization error bounds (up to logarithmic factor) can be retained for distributed SGM provided that the partition level is not too large. We then extend our results to spectral algorithms (SA), including kernel ridge regression (KRR), kernel principal component regression, and gradient methods. Our results show that distributed SGM has a smaller theoretical computational complexity, compared with distributed KRR and classic SGM. Moreover, even for a general non-distributed SA, they provide optimal, capacity-dependent convergence rates, for the case that the regression function may not be in the RKHS.

[1]  I. Pinelis,et al.  Remarks on Inequalities for Large Deviation Probabilities , 1986 .

[2]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[3]  J. Fujii,et al.  Norm inequalities equivalent to Heinz inequality , 1993 .

[4]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[5]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[6]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[7]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[8]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[9]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[10]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[11]  Tong Zhang,et al.  Learning Bounds for Kernel Regression Using Effective Data Dimensionality , 2005, Neural Computation.

[12]  Emmanuel J. Candès,et al.  Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information , 2004, IEEE Transactions on Information Theory.

[13]  Yiming Ying,et al.  Learning Rates of Least-Square Regularized Regression , 2006, Found. Comput. Math..

[14]  Yuan Yao,et al.  Online Learning Algorithms , 2006, Found. Comput. Math..

[15]  Lorenzo Rosasco,et al.  On regularization algorithms in learning theory , 2007, J. Complex..

[16]  H. Robbins A Stochastic Approximation Method , 1951 .

[17]  Ding-Xuan Zhou,et al.  Learning Theory: An Approximation Theory Viewpoint , 2007 .

[18]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[19]  S. Smale,et al.  Learning Theory Estimates via Integral Operators and Their Approximations , 2007 .

[20]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[21]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[22]  Massimiliano Pontil,et al.  Online Gradient Descent Learning Algorithms , 2008, Found. Comput. Math..

[23]  Lorenzo Rosasco,et al.  Spectral Algorithms for Supervised Learning , 2008, Neural Computation.

[24]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[25]  Gideon S. Mann,et al.  Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[26]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[27]  Gilles Blanchard,et al.  Optimal learning rates for Kernel Conjugate Gradient regression , 2010, NIPS.

[28]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[29]  Y. Yao,et al.  Cross-validation based adaptation for regularization operators in learning theory , 2010 .

[30]  Stanislav Minsker On Some Extensions of Bernstein's Inequality for Self-adjoint Operators , 2011, 1112.5448.

[31]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[32]  Rong Jin,et al.  Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison , 2012, NIPS.

[33]  J. Tropp User-Friendly Tools for Random Matrices: An Introduction , 2012 .

[34]  Sham M. Kakade,et al.  Random Design Analysis of Ridge Regression , 2012, COLT.

[35]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[36]  Yuan Yao,et al.  Online Learning as Stochastic Approximation of Regularization Paths: Optimality and Almost-Sure Convergence , 2011, IEEE Transactions on Information Theory.

[37]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[38]  Martin J. Wainwright,et al.  Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates , 2013, J. Mach. Learn. Res..

[39]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[40]  Martin J. Wainwright,et al.  Randomized sketches for kernels: Fast and optimal non-parametric regression , 2015, ArXiv.

[41]  Lorenzo Rosasco,et al.  Less is More: Nyström Computational Regularization , 2015, NIPS.

[42]  Ding-Xuan Zhou,et al.  Learning theory of randomized Kaczmarz algorithm , 2015, J. Mach. Learn. Res..

[43]  Steven C. H. Hoi,et al.  Large Scale Online Kernel Learning , 2016, J. Mach. Learn. Res..

[44]  Ingo Steinwart,et al.  Optimal Learning Rates for Localized SVMs , 2015, J. Mach. Learn. Res..

[45]  G. Blanchard,et al.  Parallelizing Spectral Algorithms for Kernel Learning , 2016, 1610.07487.

[46]  Pradeep Ravikumar,et al.  Kernel Ridge Regression via Partitioning , 2016, ArXiv.

[47]  Prateek Jain,et al.  Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging , 2016, ArXiv.

[48]  Lea Fleischer,et al.  Regularization of Inverse Problems , 1996 .

[49]  Lorenzo Rosasco,et al.  Optimal Rates for Multi-pass Stochastic Gradient Methods , 2016, J. Mach. Learn. Res..

[50]  Qiang Liu,et al.  Communication-efficient Sparse Regression , 2017, J. Mach. Learn. Res..

[51]  Ben London,et al.  A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent , 2017, NIPS.

[52]  Nicole Mucke Reducing training time by efficient localized kernel regression , 2017, 1707.03220.

[53]  Ding-Xuan Zhou,et al.  Learning theory of distributed spectral algorithms , 2017 .

[54]  Lorenzo Rosasco,et al.  FALKON: An Optimal Large Scale Kernel Method , 2017, NIPS.

[55]  Ingo Steinwart,et al.  Spatial Decompositions for Large Scale SVMs , 2017, AISTATS.

[56]  Prateek Jain,et al.  Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification , 2016, J. Mach. Learn. Res..

[57]  Daniel J. Hsu,et al.  Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operators , 2017 .

[58]  Volkan Cevher,et al.  Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage , 2017, AISTATS.

[59]  Lorenzo Rosasco,et al.  Optimal Rates for Learning with Nyström Stochastic Gradient Methods , 2017, ArXiv.

[60]  Ding-Xuan Zhou,et al.  Distributed Learning with Regularized Least Squares , 2016, J. Mach. Learn. Res..

[61]  Volkan Cevher,et al.  Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods , 2018, ICML.

[62]  Gilles Blanchard,et al.  Parallelizing Spectrally Regularized Kernel Algorithms , 2018, J. Mach. Learn. Res..

[63]  Gilles Blanchard,et al.  Optimal Rates for Regularization of Statistical Inverse Learning Problems , 2016, Found. Comput. Math..

[64]  Volkan Cevher,et al.  Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral-Regularization Algorithms , 2018, ArXiv.

[65]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[66]  Alessandro Rudi,et al.  Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes , 2018, NeurIPS.

[67]  Lorenzo Rosasco,et al.  Learning with SGD and Random Features , 2018, NeurIPS.

[68]  Lorenzo Rosasco,et al.  Beating SGD Saturation with Tail-Averaging and Minibatching , 2019, NeurIPS.

[69]  Dominic Richards,et al.  Optimal Statistical Rates for Decentralised Non-Parametric Regression with Linear Speed-Up , 2019, NeurIPS.

[70]  Francesco Orabona,et al.  Kernel Truncated Randomized Ridge Regression: Optimal Rates and Low Noise Acceleration , 2019, NeurIPS.

[71]  Lorenzo Rosasco,et al.  Implicit Regularization of Accelerated Methods in Hilbert Spaces , 2019, NeurIPS.

[72]  Ingo Steinwart,et al.  Sobolev Norm Learning Rates for Regularized Least-Squares Algorithms , 2017, J. Mach. Learn. Res..