Distributed Kernel Principal Component Analysis

Kernel Principal Component Analysis (KPCA) is a key technique in machine learning for extracting the nonlinear structure of data and pre-processing it for downstream learning algorithms. We study the distributed setting in which there are multiple workers, each holding a set of points, who wish to compute the principal components of the union of their pointsets. Our main result is a communication efficient algorithm that takes as input arbitrary data points and computes a set of global principal components, that give relative-error approximation for polynomial kernels, or give relative-error approximation with an arbitrarily small additive error for a wide family of kernels including Gaussian kernels. While recent work shows how to do PCA in a distributed setting, the kernel setting is significantly more challenging. Although the "kernel trick" is useful for efficient computation, it is unclear how to use it to reduce communication. The main problem with previous work is that it achieves communication proportional to the dimension of the data points, which would be proportional to the dimension of the feature space, or to the number of examples, both of which could be very large. We instead first select a small subset of points whose span contains a good approximation (the column subset selection problem, CSS), and then use sketching for low rank approximation to achieve our result. The column subset selection is done using a careful combination of oblivious subspace embeddings for kernels, oblivious leverage score approximation, and adaptive sampling. These lead to nearly optimal communication bound for CSS, and also lead to nearly optimal communication for KPCA in the constant approximation region. Experiments on large scale real world datasets show the efficacy of our algorithm.

[1]  Santosh S. Vempala,et al.  Adaptive Sampling and Fast Low-Rank Matrix Approximation , 2006, APPROX-RANDOM.

[2]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[3]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[4]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[5]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[6]  Christos Boutsidis,et al.  Optimal CUR matrix decompositions , 2014, STOC.

[7]  Maria-Florina Balcan,et al.  Distributed Learning, Communication Complexity and Privacy , 2012, COLT.

[8]  W. Rudin,et al.  Fourier Analysis on Groups. , 1965 .

[9]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[10]  I. Dhillon,et al.  A Unified View of Kernel k-means , Spectral Clustering and Graph Cuts , 2004 .

[11]  Santosh S. Vempala,et al.  An algorithmic theory of learning: Robust concepts and random projection , 1999, Machine Learning.

[12]  Michael W. Mahoney Boyd,et al.  Randomized Algorithms for Matrices and Data , 2010 .

[13]  Christos Boutsidis,et al.  Communication-optimal Distributed Principal Component Analysis in the Column-partition Model , 2015, ArXiv.

[14]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[15]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[16]  S. Muthukrishnan,et al.  Relative-Error CUR Matrix Decompositions , 2007, SIAM J. Matrix Anal. Appl..

[17]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[18]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[19]  Bernard Chazelle,et al.  The Fast Johnson--Lindenstrauss Transform and Approximate Nearest Neighbors , 2009, SIAM J. Comput..

[20]  Michael W. Mahoney,et al.  Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression , 2012, STOC '13.

[21]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[22]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[23]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[24]  David P. Woodruff,et al.  Improved Distributed Principal Component Analysis , 2014, NIPS.

[25]  Santosh S. Vempala,et al.  Principal Component Analysis and Higher Correlations for Distributed Data , 2013, COLT.

[26]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[27]  Huy L. Nguyen,et al.  OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[28]  David P. Woodruff,et al.  Subspace Embeddings for the Polynomial Kernel , 2014, NIPS.