论文信息 - Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates - 字舞流文

Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates

Stochastic majorization-minimization (SMM) is a class of stochastic optimization algorithms that proceed by sampling new data points and minimizing a recursive average of surrogate functions of an objective function. The surrogates are required to be strongly convex and convergence rate analysis for the general non-convex setting was not available. In this paper, we propose an extension of SMM where surrogates are allowed to be only weakly convex or block multi-convex, and the averaged surrogates are approximately minimized with proximal regularization or block-minimized within diminishing radii, respectively. For the general nonconvex constrained setting with non-i.i.d. data samples, we show that the first-order optimality gap of the proposed algorithm decays at the rate $O((\log n)^{1+\epsilon}/n^{1/2})$ for the empirical loss and $O((\log n)^{1+\epsilon}/n^{1/4})$ for the expected loss, where $n$ denotes the number of data samples processed. Under some additional assumption, the latter convergence rate can be improved to $O((\log n)^{1+\epsilon}/n^{1/2})$. As a corollary, we obtain the first convergence rate bounds for various optimization methods under general nonconvex dependent data setting: Double-averaging projected gradient descent and its generalizations, proximal point empirical risk minimization, and online matrix/tensor decomposition algorithms. We also provide experimental validation of our results.

[1] Hanbaek Lyu. Convergence of block coordinate descent with diminishing radius for nonconvex optimization , 2020 .

[2] D. Needell,et al. Online Nonnegative CP-dictionary Learning for Markovian Data , 2020, J. Mach. Learn. Res..

[3] D. Needell,et al. Online matrix factorization for Markovian data and applications to Network Dictionary Learning , 2019, ArXiv.

[4] Tamara G. Kolda,et al. Stochastic Gradients for Large-Scale Tensor Decomposition , 2019, SIAM J. Math. Data Sci..

[5] Rong Jin,et al. On the Convergence of (Stochastic) Gradient Descent with Extrapolation for Non-Convex Minimization , 2019, IJCAI.

[6] Wotao Yin,et al. Markov chain block coordinate descent , 2018, Computational Optimization and Applications.

[7] Wotao Yin,et al. On Markov Chain Gradient Descent , 2018, NeurIPS.

[8] Xiaoxia Wu,et al. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.

[9] Dmitriy Drusvyatskiy,et al. Stochastic model-based minimization of weakly convex functions , 2018, SIAM J. Optim..

[10] Amir Beck,et al. First-Order Methods in Optimization , 2017 .

[11] Gaël Varoquaux,et al. Stochastic Subsampling for Factorizing Huge Matrices , 2017, IEEE Transactions on Signal Processing.

[12] Xiao Zhang,et al. A Unified Computational and Statistical Framework for Nonconvex Low-rank Matrix Estimation , 2016, AISTATS.

[13] Vincent Yan Fu Tan,et al. Online Nonnegative Matrix Factorization with General Divergences , 2016, AISTATS.

[14] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[15] Ya-Xiang Yuan,et al. Recent advances in trust region algorithms , 2015, Mathematical Programming.

[16] Stephen J. Wright. Coordinate descent algorithms , 2015, Mathematical Programming.

[17] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18] Stephen P. Boyd,et al. Proximal Algorithms , 2013, Found. Trends Optim..

[19] Wotao Yin,et al. A Block Coordinate Descent Method for Regularized Multiconvex Optimization with Applications to Nonnegative Tensor Factorization and Completion , 2013, SIAM J. Imaging Sci..

[20] Saeed Ghadimi,et al. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[21] Julien Mairal,et al. Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization , 2013, NIPS.

[22] Julien Mairal,et al. Optimization with First-Order Surrogate Functions , 2013, ICML.

[23] Yurii Nesterov,et al. Gradient methods for minimizing composite functions , 2012, Mathematical Programming.

[24] Andrew Gelman,et al. Handbook of Markov Chain Monte Carlo , 2011 .

[25] Yoram Singer,et al. Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[26] Guillermo Sapiro,et al. Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[27] Mikael Johansson,et al. A Randomized Incremental Subgradient Method for Distributed Optimization in Networked Systems , 2009, SIAM J. Optim..

[28] O. Cappé,et al. On‐line expectation–maximization algorithm for latent data models , 2009 .

[29] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[30] Léon Bottou,et al. The Tradeoffs of Large Scale Learning , 2007, NIPS.

[31] Mikael Johansson,et al. A simple peer-to-peer algorithm for distributed optimization in sensor networks , 2007, 2007 46th IEEE Conference on Decision and Control.

[32] R. Tibshirani,et al. Least angle regression , 2004, math/0406456.

[33] L. Stefanski,et al. The Calculus of M-Estimation , 2002 .

[34] S. R. Jammalamadaka,et al. Empirical Processes in M-Estimation , 2001 .

[35] Luigi Grippo,et al. On the convergence of the block nonlinear Gauss-Seidel method under convex constraints , 2000, Oper. Res. Lett..

[36] D. Hunter,et al. Optimization Transfer Using Surrogate Objective Functions , 2000 .

[37] R. Horst,et al. DC Programming: Overview , 1999 .

[38] Geoffrey E. Hinton,et al. A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[39] Krzysztof J. Cios,et al. Advances in neural information processing systems 7 , 1997 .

[40] C. Geyer. On the Asymptotics of Constrained $M$-Estimation , 1994 .

[41] R. Durrett. Probability: Theory and Examples , 1993 .

[42] J. Chang,et al. Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[43] L. Tucker,et al. Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[44] J. Zou,et al. ON OF STOCHASTIC GRADIENT ILL-POSED , 2020 .

[45] D. Gleich. TRUST REGION METHODS , 2017 .

[46] Richard A. Levine,et al. Journal of Computational and Graphical Statistics , 2014 .

[47] Y. Nesterov. Gradient methods for minimizing composite functions , 2013, Math. Program..

[48] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[49] Marc Teboulle,et al. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[50] A. Kleywegt,et al. Stochastic Optimization , 2003 .

[51] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[52] Richard L. Tweedie,et al. Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[53] Richard A. Harshman,et al. Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[54] K. Schittkowski,et al. NONLINEAR PROGRAMMING , 2022 .