Distributed Stochastic Algorithms for High-rate Streaming Principal Component Analysis

This paper considers the problem of estimating the principal eigenvector of a covariance matrix from independent and identically distributed data samples in streaming settings. The streaming rate of data in many contemporary applications can be high enough that a single processor cannot finish an iteration of existing methods for eigenvector estimation before a new sample arrives. This paper formulates and analyzes a distributed variant of the classical Krasulina's method (D-Krasulina) that can keep up with the high streaming rate of data by distributing the computational load across multiple processing nodes. The analysis shows that---under appropriate conditions---D-Krasulina converges to the principal eigenvector in an order-wise optimal manner; i.e., after receiving $M$ samples across all nodes, its estimation error can be $O(1/M)$. In order to reduce the network communication overhead, the paper also develops and analyzes a mini-batch extension of D-Krasulina, which is termed DM-Krasulina. The analysis of DM-Krasulina shows that it can also achieve order-optimal estimation error rates under appropriate conditions, even when some samples have to be discarded within the network due to communication latency. Finally, experiments are performed over synthetic and real-world data to validate the convergence behaviors of D-Krasulina and DM-Krasulina in high-rate streaming settings.

[1]  R. Durrett Probability: Theory and Examples , 1993 .

[2]  O. Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate. , 2014 .

[3]  Edo Liberty,et al.  Simple and deterministic matrix sketching , 2012, KDD.

[4]  Wojciech Kotlowski,et al.  Bandit Principal Component Analysis , 2019, COLT.

[5]  Raman Arora,et al.  Streaming Principal Component Analysis in Noisy Settings , 2018, ICML.

[6]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[7]  Dan Garber,et al.  On the Regret Minimization of Nonconvex Online Gradient Ascent for Online PCA , 2018, COLT.

[8]  Terence D. Sanger,et al.  Optimal unsupervised learning in a single-layer linear feedforward neural network , 1989, Neural Networks.

[9]  E. Oja,et al.  On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix , 1985 .

[10]  Ohad Shamir,et al.  Better Mini-Batch Algorithms via Accelerated Gradient Methods , 2011, NIPS.

[11]  Tengyu Ma,et al.  Online Learning of Eigenvectors , 2015, ICML.

[12]  Nathan Srebro,et al.  Stochastic Optimization of PCA with Capped MSG , 2013, NIPS.

[13]  Ohad Shamir,et al.  Distributed stochastic optimization and learning , 2014, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[14]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[15]  Shai Shalev-Shwartz,et al.  Near-Optimal Algorithms for Online Matrix Prediction , 2012, COLT.

[16]  Elad Hazan,et al.  Fast and Simple PCA via Convex Optimization , 2015, ArXiv.

[17]  Prateek Jain,et al.  Streaming PCA: Matching Matrix Bernstein and Near-Optimal Finite Sample Guarantees for Oja's Algorithm , 2016, COLT.

[18]  Amelia Henriksen,et al.  AdaOja: Adaptive Learning Rates for Streaming PCA , 2019, ArXiv.

[19]  Christos Boutsidis,et al.  Optimal principal component analysis in distributed and streaming models , 2015, STOC.

[20]  Yuanzhi Li,et al.  First Efficient Convergence for Streaming k-PCA: A Global, Gap-Free, and Near-Optimal Rate , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[21]  Chun-Liang Li,et al.  Rivalry of Two Families of Algorithms for Memory-Restricted Streaming PCA , 2015, AISTATS.

[22]  T. P. Krasulina The method of stochastic approximation for the determination of the least eigenvalue of a symmetrical matrix , 1969 .

[23]  H. Robbins A Stochastic Approximation Method , 1951 .

[24]  Alexander Shapiro,et al.  On the Rate of Convergence of Optimal Solutions of Monte Carlo Approximations of Stochastic Programs , 2000, SIAM J. Optim..

[25]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[26]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[27]  Zeyuan Allen-Zhu,et al.  How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD , 2018, NeurIPS.

[28]  Ohad Shamir,et al.  Fast Stochastic Algorithms for SVD and PCA: Convergence Properties and Convexity , 2015, ICML.

[29]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[30]  Dejiao Zhang,et al.  Global Convergence of a Grassmannian Gradient Descent Algorithm for Subspace Estimation , 2015, AISTATS.

[31]  Christopher De Sa,et al.  Global Convergence of Stochastic Gradient Descent for Some Non-convex Matrix Problems , 2014, ICML.

[32]  Ioannis Mitliagkas,et al.  Accelerated Stochastic Power Iteration , 2017, AISTATS.

[33]  Ohad Shamir,et al.  Communication-efficient Algorithms for Distributed Stochastic Principal Component Analysis , 2017, ICML.

[34]  Chanchal Chatterjee,et al.  Adaptive algorithms for first principal eigenvector computation , 2005, Neural Networks.

[35]  Jiazhong Nie,et al.  Online PCA with Optimal Regret , 2016, J. Mach. Learn. Res..

[36]  David P. Woodru Sketching as a Tool for Numerical Linear Algebra , 2014 .

[37]  Cheng Tang,et al.  Exponentially convergent stochastic k-PCA without variance reduction , 2019, NeurIPS.

[38]  Cho-Jui Hsieh,et al.  History PCA: A New Algorithm for Streaming PCA , 2018, 1802.05447.

[39]  Le Song,et al.  Communication Efficient Distributed Kernel Principal Component Analysis , 2015, KDD.

[40]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[41]  Bin Yang,et al.  Projection approximation subspace tracking , 1995, IEEE Trans. Signal Process..

[42]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[43]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[44]  Alexander J. Smola,et al.  Fast Stochastic Methods for Nonsmooth Nonconvex Optimization , 2016, ArXiv.

[45]  Alexander Shapiro,et al.  The empirical behavior of sampling methods for stochastic programming , 2006, Ann. Oper. Res..

[46]  Hanqing Lu,et al.  Online sketching hashing , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[48]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[49]  Joseph Gonzalez,et al.  On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent , 2018, ArXiv.

[50]  Yuanzhi Li,et al.  Follow the Compressed Leader: Faster Online Learning of Eigenvectors and Faster MMWU , 2017, ICML.

[51]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[52]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[53]  Manfred K. Warmuth,et al.  Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension , 2006, NIPS.

[54]  Sanjoy Dasgupta,et al.  The Fast Convergence of Incremental PCA , 2013, NIPS.

[55]  Zeyuan Allen-Zhu,et al.  Natasha 2: Faster Non-Convex Optimization Than SGD , 2017, NeurIPS.

[56]  George V. Moustakides,et al.  Fast and Stable Subspace Tracking , 2008, IEEE Transactions on Signal Processing.

[57]  Shai Shalev-Shwartz,et al.  On Graduated Optimization for Stochastic Non-Convex Problems , 2015, ICML.

[58]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[59]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[60]  Moritz Hardt,et al.  The Noisy Power Method: A Meta Algorithm with Applications , 2013, NIPS.

[61]  Zohar S. Karnin,et al.  Online {PCA} with Spectral Bounds , 2015 .

[62]  Manfred K. Warmuth,et al.  An Implicit Form of Krasulina's k-PCA Update without the Orthonormality Constraint , 2019, AAAI.

[63]  Alexander J. Smola,et al.  Fast incremental method for smooth nonconvex optimization , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).