Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates

Stochastic majorization-minimization (SMM) is a class of stochastic optimization algorithms that proceed by sampling new data points and minimizing a recursive average of surrogate functions of an objective function. The surrogates are required to be strongly convex and convergence rate analysis for the general non-convex setting was not available. In this paper, we propose an extension of SMM where surrogates are allowed to be only weakly convex or block multi-convex, and the averaged surrogates are approximately minimized with proximal regularization or block-minimized within diminishing radii, respectively. For the general nonconvex constrained setting with non-i.i.d. data samples, we show that the first-order optimality gap of the proposed algorithm decays at the rate $O((\log n)^{1+\epsilon}/n^{1/2})$ for the empirical loss and $O((\log n)^{1+\epsilon}/n^{1/4})$ for the expected loss, where $n$ denotes the number of data samples processed. Under some additional assumption, the latter convergence rate can be improved to $O((\log n)^{1+\epsilon}/n^{1/2})$. As a corollary, we obtain the first convergence rate bounds for various optimization methods under general nonconvex dependent data setting: Double-averaging projected gradient descent and its generalizations, proximal point empirical risk minimization, and online matrix/tensor decomposition algorithms. We also provide experimental validation of our results.

[1]  Hanbaek Lyu Convergence of block coordinate descent with diminishing radius for nonconvex optimization , 2020 .

[2]  D. Needell,et al.  Online Nonnegative CP-dictionary Learning for Markovian Data , 2020, J. Mach. Learn. Res..

[3]  D. Needell,et al.  Online matrix factorization for Markovian data and applications to Network Dictionary Learning , 2019, ArXiv.

[4]  Tamara G. Kolda,et al.  Stochastic Gradients for Large-Scale Tensor Decomposition , 2019, SIAM J. Math. Data Sci..

[5]  Rong Jin,et al.  On the Convergence of (Stochastic) Gradient Descent with Extrapolation for Non-Convex Minimization , 2019, IJCAI.

[6]  Wotao Yin,et al.  Markov chain block coordinate descent , 2018, Computational Optimization and Applications.

[7]  Wotao Yin,et al.  On Markov Chain Gradient Descent , 2018, NeurIPS.

[8]  Xiaoxia Wu,et al.  AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.

[9]  Dmitriy Drusvyatskiy,et al.  Stochastic model-based minimization of weakly convex functions , 2018, SIAM J. Optim..

[10]  Amir Beck,et al.  First-Order Methods in Optimization , 2017 .

[11]  Gaël Varoquaux,et al.  Stochastic Subsampling for Factorizing Huge Matrices , 2017, IEEE Transactions on Signal Processing.

[12]  Xiao Zhang,et al.  A Unified Computational and Statistical Framework for Nonconvex Low-rank Matrix Estimation , 2016, AISTATS.

[13]  Vincent Yan Fu Tan,et al.  Online Nonnegative Matrix Factorization with General Divergences , 2016, AISTATS.

[14]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[15]  Ya-Xiang Yuan,et al.  Recent advances in trust region algorithms , 2015, Mathematical Programming.

[16]  Stephen J. Wright Coordinate descent algorithms , 2015, Mathematical Programming.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[19]  Wotao Yin,et al.  A Block Coordinate Descent Method for Regularized Multiconvex Optimization with Applications to Nonnegative Tensor Factorization and Completion , 2013, SIAM J. Imaging Sci..

[20]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[21]  Julien Mairal,et al.  Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization , 2013, NIPS.

[22]  Julien Mairal,et al.  Optimization with First-Order Surrogate Functions , 2013, ICML.

[23]  Yurii Nesterov,et al.  Gradient methods for minimizing composite functions , 2012, Mathematical Programming.

[24]  Andrew Gelman,et al.  Handbook of Markov Chain Monte Carlo , 2011 .

[25]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[26]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[27]  Mikael Johansson,et al.  A Randomized Incremental Subgradient Method for Distributed Optimization in Networked Systems , 2009, SIAM J. Optim..

[28]  O. Cappé,et al.  On‐line expectation–maximization algorithm for latent data models , 2009 .

[29]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[30]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[31]  Mikael Johansson,et al.  A simple peer-to-peer algorithm for distributed optimization in sensor networks , 2007, 2007 46th IEEE Conference on Decision and Control.

[32]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[33]  L. Stefanski,et al.  The Calculus of M-Estimation , 2002 .

[34]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[35]  Luigi Grippo,et al.  On the convergence of the block nonlinear Gauss-Seidel method under convex constraints , 2000, Oper. Res. Lett..

[36]  D. Hunter,et al.  Optimization Transfer Using Surrogate Objective Functions , 2000 .

[37]  R. Horst,et al.  DC Programming: Overview , 1999 .

[38]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[39]  Krzysztof J. Cios,et al.  Advances in neural information processing systems 7 , 1997 .

[40]  C. Geyer On the Asymptotics of Constrained $M$-Estimation , 1994 .

[41]  R. Durrett Probability: Theory and Examples , 1993 .

[42]  J. Chang,et al.  Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[43]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[44]  J. Zou,et al.  ON OF STOCHASTIC GRADIENT ILL-POSED , 2020 .

[45]  D. Gleich TRUST REGION METHODS , 2017 .

[46]  Richard A. Levine,et al.  Journal of Computational and Graphical Statistics , 2014 .

[47]  Y. Nesterov Gradient methods for minimizing composite functions , 2013, Math. Program..

[48]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[49]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[50]  A. Kleywegt,et al.  Stochastic Optimization , 2003 .

[51]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[52]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[53]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[54]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .