Communication Efficient Distributed Kernel Principal Component Analysis

Kernel Principal Component Analysis (KPCA) is a key machine learning algorithm for extracting nonlinear features from data. In the presence of a large volume of high dimensional data collected in a distributed fashion, it becomes very costly to communicate all of this data to a single data center and then perform kernel PCA. Can we perform kernel PCA on the entire dataset in a distributed and communication efficient fashion while maintaining provable and strong guarantees in solution quality? In this paper, we give an affirmative answer to the question by developing a communication efficient algorithm to perform kernel PCA in the distributed setting. The algorithm is a clever combination of subspace embedding and adaptive sampling techniques, and we show that the algorithm can take as input an arbitrary configuration of distributed datasets, and compute a set of global kernel principal components with relative error guarantees independent of the dimension of the feature space or the total number of data points. In particular, computing k principal components with relative error ε over s workers has communication cost Õ(spk/ε+sk2/ε3) words, where p is the average number of nonzero entries in each data point. Furthermore, we experimented the algorithm with large-scale real world datasets and showed that the algorithm produces a high quality kernel PCA solution while using significantly less communication than alternative approaches.

[1]  Acknowledgments , 2006, Molecular and Cellular Endocrinology.

[2]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[3]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[4]  Michael W. Mahoney,et al.  Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression , 2012, STOC '13.

[5]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[6]  Michael I. Jordan,et al.  Predictive low-rank decomposition for kernel methods , 2005, ICML.

[7]  David P. Woodruff,et al.  Improved Distributed Principal Component Analysis , 2014, NIPS.

[8]  Santosh S. Vempala,et al.  Principal Component Analysis and Higher Correlations for Distributed Data , 2013, COLT.

[9]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[10]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[11]  Le Song,et al.  Distributed Kernel Principal Component Analysis , 2015, ArXiv.

[12]  S. Muthukrishnan,et al.  Relative-Error CUR Matrix Decompositions , 2007, SIAM J. Matrix Anal. Appl..

[13]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[14]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[15]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[16]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[17]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[18]  Bernard Chazelle,et al.  The Fast Johnson--Lindenstrauss Transform and Approximate Nearest Neighbors , 2009, SIAM J. Comput..

[19]  David P. Woodruff,et al.  Subspace Embeddings for the Polynomial Kernel , 2014, NIPS.

[20]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[21]  Santosh S. Vempala,et al.  Adaptive Sampling and Fast Low-Rank Matrix Approximation , 2006, APPROX-RANDOM.

[22]  Christos Boutsidis,et al.  Communication-optimal Distributed Principal Component Analysis in the Column-partition Model , 2015, ArXiv.

[23]  Maria-Florina Balcan,et al.  Distributed Learning, Communication Complexity and Privacy , 2012, COLT.

[24]  References , 1971 .

[25]  Santosh S. Vempala,et al.  An algorithmic theory of learning: Robust concepts and random projection , 1999, Machine Learning.

[26]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[27]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[28]  Christos Boutsidis,et al.  Optimal CUR matrix decompositions , 2014, STOC.

[29]  Huy L. Nguyen,et al.  OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[30]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[31]  Ameet Talwalkar,et al.  Sampling Methods for the Nyström Method , 2012, J. Mach. Learn. Res..

[32]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.