论文信息 - Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates

Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates

We study a decomposition-based scalable approach to kernel ridge regression, and show that it achieves minimax optimal convergence rates under relatively mild conditions. The method is simple to describe: it randomly partitions a dataset of size N into m subsets of equal size, computes an independent kernel ridge regression estimator for each subset using a careful choice of the regularization parameter, then averages the local solutions into a global predictor. This partitioning leads to a substantial reduction in computation time versus the standard approach of performing kernel ridge regression on all N samples. Our two main theorems establish that despite the computational speed-up, statistical optimality is retained: as long as m is not too large, the partition-based estimator achieves the statistical minimax rate over all estimators using the set of N samples. As concrete examples, our theory guarantees that the number of subsets m may grow nearly linearly for finite-rank or Gaussian kernels and polynomially in N for Sobolev spaces, which in turn allows for substantial reductions in computational cost. We conclude with experiments on both simulated data and a music-prediction task that complement our theoretical results, exhibiting the computational and statistical benefits of our approach.

Martin J. Wainwright | Yuchen Zhang | John C. Duchi | M. Wainwright | Yuchen Zhang

[1] M. Birman,et al. PIECEWISE-POLYNOMIAL APPROXIMATIONS OF FUNCTIONS OF THE CLASSES $ W_{p}^{\alpha}$ , 1967 .

[2] D. Luenberger. Optimization by Vector Space Methods , 1968 .

[3] J. Kuelbs. Probability on Banach spaces , 1978 .

[4] C. J. Stone,et al. Optimal Global Rates of Convergence for Nonparametric Regression , 1982 .

[5] Charles R. Johnson,et al. Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[6] G. Wahba. Spline models for observational data , 1990 .

[7] Bernhard Schölkopf,et al. Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[8] Alexander Gammerman,et al. Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[9] Arthur E. Hoerl,et al. Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[10] Christopher K. I. Williams,et al. Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[11] Katya Scheinberg,et al. Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[12] S. R. Jammalamadaka,et al. Empirical Processes in M-Estimation , 2001 .

[13] Shahar Mendelson,et al. Geometric Parameters of Kernel Machines , 2002, COLT.

[14] Chong Gu. Smoothing Spline Anova Models , 2002 .

[15] Shahar Mendelson,et al. Improving the sample complexity using global data , 2002, IEEE Trans. Inf. Theory.

[16] L. Györfi,et al. A Distribution-Free Theory of Nonparametric Regression (Springer Series in Statistics) , 2002 .

[17] Peter L. Bartlett,et al. Localized Rademacher Complexities , 2002, COLT.

[18] Adam Krzyzak,et al. A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[19] Nello Cristianini,et al. Kernel Methods for Pattern Analysis , 2003, ICTAI.

[20] Tong Zhang,et al. Leave-One-Out Bounds for Kernel Methods , 2003, Neural Computation.

[21] Eric R. Ziegel,et al. The Elements of Statistical Learning , 2003, Technometrics.

[22] A. Berlinet,et al. Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[23] P. Bartlett,et al. Local Rademacher complexities , 2005, math/0508275.

[24] Tong Zhang,et al. Learning Bounds for Kernel Regression Using Effective Data Dimensionality , 2005, Neural Computation.

[25] Larry Wasserman,et al. All of Nonparametric Statistics (Springer Texts in Statistics) , 2006 .

[26] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[27] Stergios B. Fotopoulos,et al. All of Nonparametric Statistics , 2007, Technometrics.

[28] Benjamin Recht,et al. Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[29] A. Caponnetto,et al. Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[30] Y. Yao,et al. On Early Stopping in Gradient Descent Learning , 2007 .

[31] Don R. Hush,et al. Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[32] Alexandre B. Tsybakov,et al. Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[33] Gilles Blanchard,et al. Optimal learning rates for Kernel Conjugate Gradient regression , 2010, NIPS.

[34] Gideon S. Mann,et al. Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.

[35] Thierry Bertin-Mahieux,et al. The Million Song Dataset , 2011, ISMIR.

[36] Richard Y. Chen,et al. The Masked Sample Covariance Estimator: An Analysis via Matrix Concentration Inequalities , 2011, 1109.1637.

[37] Purnamrita Sarkar,et al. Bootstrapping Big Data , 2011 .

[38] Martin J. Wainwright,et al. Early stopping for non-parametric regression: An optimal data-dependent stopping rule , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[39] Martin J. Wainwright,et al. Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[40] Purnamrita Sarkar,et al. The Big Data Bootstrap , 2012, ICML.

[41] Martin J. Wainwright,et al. Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming , 2010, J. Mach. Learn. Res..

[42] Sham M. Kakade,et al. Random Design Analysis of Ridge Regression , 2012, COLT.

[43] Francis R. Bach,et al. Sharp analysis of low-rank kernel matrix approximations , 2012, COLT.

[44] Martin J. Wainwright,et al. Randomized sketches for kernels: Fast and optimal non-parametric regression , 2015, ArXiv.