Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector Problems

Motivated by the problem of online canonical correlation analysis, we propose the Stochastic Scaled-Gradient Descent (SSGD) algorithm for minimizing the expectation of a stochastic function over a generic Riemannian manifold. SSGD generalizes the idea of projected stochastic gradient descent and allows the use of scaled stochastic gradients instead of stochastic gradients. In the special case of a spherical constraint, which arises in generalized eigenvector problems, we establish a nonasymptotic finite-sample bound of √ 1/T , and show that this rate is minimax optimal, up to a polylogarithmic factor of relevant parameters. On the asymptotic side, a novel trajectory-averaging argument allows us to achieve local asymptotic normality with a rate that matches that of Ruppert-Polyak-Juditsky averaging. We bring these ideas together in an application to online canonical correlation analysis, deriving, for the first time in the literature, an optimal one-time-scale algorithm with an explicit rate of local asymptotic convergence to normality. Numerical studies of canonical correlation analysis are also provided for synthetic data.

[1]  Lin F. Yang,et al.  On Constrained Nonconvex Stochastic Optimization: A Case Study for Generalized Eigenvalue Decomposition , 2019, AISTATS.

[2]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[3]  Handbook of Variational Methods for Nonlinear Geometric Data , 2020 .

[4]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[5]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[6]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[7]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[8]  Sham M. Kakade,et al.  Efficient Algorithms for Large-scale Generalized Eigenvector Computation and Canonical Correlation Analysis , 2016, ICML.

[9]  Nathan Srebro,et al.  Stochastic Approximation for Canonical Correlation Analysis , 2017, NIPS.

[10]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[11]  Le Song,et al.  Learning from Conditional Distributions via Dual Embeddings , 2016, AISTATS.

[12]  Kean Ming Tan,et al.  Sparse generalized eigenvalue problem: optimal statistical rates via truncated Rayleigh flow , 2016, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[13]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[14]  H. Robbins A Stochastic Approximation Method , 1951 .

[15]  Michael I. Jordan,et al.  Gen-Oja: Simple & Efficient Algorithm for Streaming Generalized Eigenvector Computation , 2018, NeurIPS.

[16]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[17]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[18]  G. A. Young,et al.  High‐dimensional Statistics: A Non‐asymptotic Viewpoint, Martin J.Wainwright, Cambridge University Press, 2019, xvii 552 pages, £57.99, hardback ISBN: 978‐1‐1084‐9802‐9 , 2020, International Statistical Review.

[19]  Sham M. Kakade,et al.  Multi-view clustering via canonical correlation analysis , 2009, ICML '09.

[20]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[21]  Quansheng Liu,et al.  Large deviation exponential inequalities for supermartingales , 2011 .

[22]  Xiao-Tong Yuan,et al.  Truncated power method for sparse eigenvalue problems , 2011, J. Mach. Learn. Res..

[23]  Ohad Shamir,et al.  Convergence of Stochastic Gradient Descent for PCA , 2015, ICML.

[24]  Michael I. Jordan,et al.  Stochastic Approximation for Online Tensorial Independent Component Analysis , 2020, COLT.

[25]  Chao Gao,et al.  Stochastic Canonical Correlation Analysis , 2017, J. Mach. Learn. Res..

[26]  Nathan Srebro,et al.  Stochastic optimization for PCA and PLS , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[27]  Suvrit Sra,et al.  Recent Advances in Stochastic Riemannian Optimization , 2020 .

[28]  Zongming Ma Sparse Principal Component Analysis and Iterative Thresholding , 2011, 1112.2432.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[31]  Martin J. Wainwright,et al.  Statistical guarantees for the EM algorithm: From population to sample-based analysis , 2014, ArXiv.

[32]  Dean P. Foster,et al.  Finding Linear Structure in Large Datasets with Scalable Canonical Correlation Analysis , 2015, ICML.

[33]  Michael I. Jordan,et al.  On Nonconvex Optimization for Machine Learning , 2019, J. ACM.

[34]  Li Shen,et al.  A Decomposition Algorithm for the Sparse Generalized Eigenvalue Problem , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  M. Stone Continuum regression: Cross-validated sequentially constructed prediction embracing ordinary least s , 1990 .

[36]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[37]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[38]  Yuanzhi Li,et al.  First Efficient Convergence for Streaming k-PCA: A Global, Gap-Free, and Near-Optimal Rate , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[39]  Cheng Li,et al.  Fisher Linear Discriminant Analysis , 2014 .

[40]  Tong Zhang,et al.  Near-optimal stochastic approximation for online principal component estimation , 2016, Math. Program..

[41]  Suvrit Sra,et al.  First-order Methods for Geodesically Convex Optimization , 2016, COLT.

[42]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[43]  Arun K. Kuchibhotla,et al.  Moving Beyond Sub-Gaussianity in High-Dimensional Statistics: Applications in Covariance Estimation and Linear Regression , 2018, 1804.02605.

[44]  Yuanzhi Li,et al.  Doubly Accelerated Methods for Faster CCA and Generalized Eigendecomposition , 2016, ICML.

[45]  E. Oja Simplified neuron model as a principal component analyzer , 1982, Journal of mathematical biology.

[46]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[47]  Prateek Jain,et al.  Streaming PCA: Matching Matrix Bernstein and Near-Optimal Finite Sample Guarantees for Oja's Algorithm , 2016, COLT.

[48]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .