Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems

Kernel Ridge Regression (KRR) is a fundamental method in machine learning. Given an n-by-d data matrix as input, a traditional implementation requires Θ(n2) memory to form an n-by-n kernel matrix and Θ(n3) flops to compute the final model. These time and storage costs prohibit KRR from scaling up to large datasets. For example, even on a relatively small dataset (a 520k-by-90 input requiring 357 MB), KRR requires 2 TB memory just to store the kernel matrix. The reason is that n usually is much larger than d for real-world applications. On the other hand, weak scaling becomes a problem: if we keep d and n/p fixed as p grows (p is # machines), the memory needed grows as Θ(p) per processor and the flops as Θ(p2) per processor. In the perfect weak scaling situation, both the memory needed and the flops grow as Θ(1) per processor (i.e. memory and flops are constant). The traditional Distributed KRR implementation (DKRR) only achieved 0.32% weak scaling efficiency from 96 to 1536 processors. We propose two new methods to address these problems: the Balanced KRR (BKRR) and K-means KRR (KKRR). These methods consider alternative ways to partition the input dataset into p different parts, generating p different models, and then selecting the best model among them. Compared to a conventional implementation, KKRR2 (optimized version of KKRR) improves the weak scaling efficiency from 0.32% to 38% and achieves a 591x speedup for getting the same accuracy by using the same data and the same hardware (1536 processors). BKRR2 (optimized version of BKRR) achieves a higher accuracy than the current fastest method using less training time for a variety of datasets. For the applications requiring only approximate solutions, BKRR2 improves the weak scaling efficiency to 92% and achieves 3505x speedup (theoretical speedup: 4096x).

[1]  Jay Srinivasan,et al.  Cori: A Pre-Exascale Supercomputer for Big Data and HPC Applications , 2014, High Performance Computing Workshop.

[2]  Martin J. Wainwright,et al.  Divide and Conquer Kernel Ridge Regression , 2013, COLT.

[3]  N. Shephard,et al.  Econometric Analysis of Realized Covariation: High Frequency Based Covariance, Regression, and Correlation in Financial Economics , 2004 .

[4]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.

[5]  H. Nkansah Least squares optimization with L1-norm regularization , 2017 .

[6]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[7]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[8]  Dean P. Foster,et al.  Faster Ridge Regression via the Subsampled Randomized Hadamard Transform , 2013, NIPS.

[9]  Le Song,et al.  CA-SVM: Communication-Avoiding Support Vector Machines on Distributed Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[10]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[11]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[12]  R. Schreiber Solving Eigenvalue and Singular Value Problems on an Undersized Systolic Array , 1986 .

[13]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[14]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[15]  Le Song,et al.  Design and Implementation of a Communication-Optimal Classifier for Distributed Kernel Support Vector Machines , 2017, IEEE Transactions on Parallel and Distributed Systems.

[16]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[17]  William B. March,et al.  ASKIT: Approximate Skeletonization Kernel-Independent Treecode in High Dimensions , 2014, SIAM J. Sci. Comput..

[18]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[19]  Inderjit S. Dhillon,et al.  Memory Efficient Kernel Approximation , 2014, ICML.

[20]  Gilles Blanchard,et al.  Optimal learning rates for Kernel Conjugate Gradient regression , 2010, NIPS.

[21]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .