Stochastic Chebyshev Gradient Descent for Spectral Optimization

A large class of machine learning techniques requires the solution of optimization problems involving spectral functions of parametric matrices, e.g. log-determinant and nuclear norm. Unfortunately, computing the gradient of a spectral function is generally of cubic complexity, as such gradient descent methods are rather expensive for optimizing objectives involving the spectral function. Thus, one naturally turns to stochastic gradient methods in hope that they will provide a way to reduce or altogether avoid the computation of full gradients. However, here a new challenge appears: there is no straightforward way to compute unbiased stochastic gradients for spectral functions. In this paper, we develop unbiased stochastic gradients for spectral-sums, an important subclass of spectral functions. Our unbiased stochastic gradients are based on combining randomized trace estimators with stochastic truncation of the Chebyshev expansions. A careful design of the truncation distribution allows us to offer distributions that are variance-optimal, which is crucial for fast and stable convergence of stochastic gradient methods. We further leverage our proposed stochastic gradients to devise stochastic methods for objective functions involving spectral-sums, and rigorously analyze their convergence rate. The utility of our methods is demonstrated in numerical experiments.

[1]  E. Davidson The iterative calculation of a few of the lowest eigenvalues and corresponding eigenvectors of large real-symmetric matrices , 1975 .

[2]  C. Berge Topological Spaces: including a treatment of multi-valued functions , 2010 .

[3]  M. Hutchinson A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines , 1989 .

[4]  Robert L. Smith,et al.  Finite dimensional approximation in infinite dimensional mathematical programming , 1992, Math. Program..

[5]  A. S. Lewis,et al.  Derivatives of Spectral Functions , 1996, Math. Oper. Res..

[6]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[7]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[8]  H. Robbins A Stochastic Approximation Method , 1951 .

[9]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[10]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[11]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[12]  Sivan Toledo,et al.  Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix , 2011, JACM.

[13]  Maryam Fazel,et al.  Iterative reweighted algorithms for matrix rank minimization , 2012, J. Mach. Learn. Res..

[14]  M. Vinck,et al.  Estimation of the entropy based on its polynomial representation. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[16]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[17]  M. Broniatowski,et al.  Some overview on unbiased interpolation and extrapolation designs , 2014, 1403.5113.

[18]  Yin Tat Lee,et al.  A Faster Cutting Plane Method and its Implications for Combinatorial and Convex Optimization , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[19]  Shuicheng Yan,et al.  Smoothed Low Rank and Sparse Matrix Recovery by Iteratively Reweighted Least Squares Minimization , 2014, IEEE Transactions on Image Processing.

[20]  Jinwoo Shin,et al.  Large-scale log-determinant computation through stochastic Chebyshev expansions , 2015, ICML.

[21]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[22]  Uri M. Ascher,et al.  Improved Bounds on Sample Size for Implicit Matrix Trace Estimators , 2013, Found. Comput. Math..

[23]  Andrew Gordon Wilson,et al.  Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP) , 2015, ICML.

[24]  Dong Xia,et al.  Optimal estimation of low rank density matrices , 2015, J. Mach. Learn. Res..

[25]  Elad Hazan,et al.  Fast and Simple PCA via Convex Optimization , 2015, ArXiv.

[26]  Zeyuan Allen Zhu,et al.  Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives , 2015, ICML.

[27]  Michael P. Friedlander,et al.  Low-Rank Spectral Optimization via Gauge Duality , 2015, SIAM J. Sci. Comput..

[28]  Andrew Gordon Wilson,et al.  Scalable Log Determinants for Gaussian Process Kernel Learning , 2017, NIPS.

[29]  Yousef Saad,et al.  Fast Estimation of tr(f(A)) via Stochastic Lanczos Quadrature , 2017, SIAM J. Matrix Anal. Appl..

[30]  Jinwoo Shin,et al.  Approximating Spectral Sums of Large-Scale Matrices using Stochastic Chebyshev Approximations , 2017, SIAM J. Sci. Comput..

[31]  Ryan P. Adams,et al.  Estimating the Spectral Density of Large Implicit Matrices , 2018, 1802.03451.