论文信息 - Practical optimization methods for machine learning models

Practical optimization methods for machine learning models

This work considers optimization methods for large-scale machine learning (ML). Optimization in ML is a crucial ingredient in the training stage of ML models. Optimization methods in this setting need to have cheap iteration cost. First-order methods are known to have reasonably low iteration costs. A notable recent class of stochastic first-order methods leverage variance reduction techniques to improve their convergence speed. This group includes stochastic average gradient (SAG), stochastic variance reduced gradient (SVRG), and stochastic average gradient amélioré (SAGA). The SAG and SAGA approach to variance reduction use additional memory in their algorithm. SVRG, on the other hand, does not need additional memory but requires occasional full-gradient evaluation. We first introduce variants of SVRG that require fewer gradient evaluations. We then present the first linearly convergent stochastic gradient method to train conditional random fields (CRFs) using SAG. Our method addresses the memory issues required for SAG and proposes a better non-uniform sampling (NUS) technique. The third part of this work extends the applicability of SAGA to Riemannian manifolds. We modify SAGA with operations existing in the manifold to improve the convergence speed of SAGA in these new spaces. Finally, we consider the convergence of classic stochastic gradient methods, based on mirror descent (MD), in non-convex setting. We analyse the MD with more general divergence function and show its application for variational inference models.

Reza Babanezhad Harikandeh

[1] Mark W. Schmidt,et al. Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization , 2011, NIPS.

[2] U. Ascher,et al. Adaptive and stochastic algorithms for EIT and DC resistivity problems with piecewise constant solutions and many measurements , 2011 .

[3] Razvan Pascanu,et al. Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[4] Mark W. Schmidt,et al. A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets , 2012, ArXiv.

[5] Eric P. Xing,et al. Conditional Topic Random Fields , 2010, ICML.

[6] Andrew McCallum,et al. Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[7] Zeyuan Allen Zhu,et al. Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives , 2015, ICML.

[8] Mark W. Schmidt,et al. Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions , 2015, UAI.

[9] Yurii Nesterov,et al. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[10] Alexander J. Smola,et al. Variance Reduction in Stochastic Gradient Langevin Dynamics , 2016, NIPS.

[11] Shai Shalev-Shwartz,et al. Accelerated Mini-Batch Stochastic Dual Coordinate Ascent , 2013, NIPS.

[12] Alexander J. Smola,et al. Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[13] Kevin P. Murphy,et al. Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[14] Karol Gregor,et al. Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[15] Rong Jin,et al. MixedGrad: An O(1/T) Convergence Rate Algorithm for Stochastic Smooth Optimization , 2013, ArXiv.

[16] Pascal Fua,et al. Kullback-Leibler Proximal Variational Inference , 2015, NIPS.

[17] Matthew D. Hoffman,et al. A trust-region method for stochastic variational inference with applications to streaming data , 2015, ICML.

[18] Phil Blunsom,et al. Semantic Role Labelling with Tree Conditional Random Fields , 2005, CoNLL.

[19] Nuno Vasconcelos,et al. Spatiotemporal Saliency in Dynamic Scenes , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[21] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[22] Peter L. Bartlett,et al. Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks , 2008, J. Mach. Learn. Res..

[23] M E J Newman,et al. Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[24] Léon Bottou,et al. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.

[25] Mark W. Schmidt,et al. Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[26] Mark W. Schmidt,et al. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[27] Silvere Bonnabel,et al. Stochastic Gradient Descent on Riemannian Manifolds , 2011, IEEE Transactions on Automatic Control.

[28] Rong Jin,et al. Linear Convergence with Condition Number Independent Access of Full Gradients , 2013, NIPS.

[29] Ambuj Tewari,et al. Composite objective mirror descent , 2010, COLT 2010.

[30] Suvrit Sra,et al. Matrix Manifold Optimization for Gaussian Mixtures , 2015, NIPS.

[31] Martin J. Wainwright,et al. Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes , 2010, J. Mach. Learn. Res..

[32] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[33] Léon Bottou,et al. A Lower Bound for the Optimization of Finite Sums , 2014, ICML.

[34] Suvrit Sra,et al. First-order Methods for Geodesically Convex Optimization , 2016, COLT.

[35] Mark W. Schmidt,et al. StopWasting My Gradients: Practical SVRG , 2015, NIPS.

[36] Hiroyuki Kasai,et al. Riemannian stochastic variance reduced gradient on Grassmann manifold , 2016, ArXiv.

[37] Juha Karhunen,et al. Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes , 2010, J. Mach. Learn. Res..

[38] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[39] Mark W. Schmidt,et al. Hybrid Deterministic-Stochastic Methods for Data Fitting , 2011, SIAM J. Sci. Comput..

[40] Deanna Needell,et al. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[41] Chong Wang,et al. Stochastic variational inference , 2012, J. Mach. Learn. Res..

[42] Yoram Singer,et al. Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[43] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[44] Mark W. Schmidt,et al. Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[45] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[46] John Wright,et al. Complete dictionary recovery over the sphere , 2015, 2015 International Conference on Sampling Theory and Applications (SampTA).

[47] Alexander J. Smola,et al. A Generic Approach for Escaping Saddle points , 2017, AISTATS.

[48] W. Ziller. Riemannian Manifolds with Positive Sectional Curvature , 2012, 1210.4102.

[49] Ben Taskar,et al. Max-Margin Markov Networks , 2003, NIPS.

[50] D. Bertsekas,et al. Convergen e Rate of In remental Subgradient Algorithms , 2000 .

[51] Tong Zhang,et al. Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[52] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[53] Jie Liu,et al. SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[54] Shakir Mohamed,et al. Variational Inference with Normalizing Flows , 2015, ICML.

[55] Ben Jeuris,et al. A survey and comparison of contemporary algorithms for computing the matrix geometric mean , 2012 .

[56] S. Rosset,et al. Piecewise linear regularized solution paths , 2007, 0708.2197.

[57] Mark W. Schmidt,et al. MASAGA: A Linearly-Convergent Stochastic First-Order Method for Optimization on Manifolds , 2018, ECML/PKDD.

[58] S. Sathiya Keerthi,et al. A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs , 2005, J. Mach. Learn. Res..