Faster Coordinate Descent via Adaptive Importance Sampling

Coordinate descent methods employ random partial updates of decision variables in order to solve huge-scale convex optimization problems. In this work, we introduce new adaptive rules for the random selection of their updates. By adaptive, we mean that our selection rules are based on the dual residual or the primal-dual gap estimates and can change at each iteration. We theoretically characterize the performance of our selection rules and demonstrate improvements over the state-of-the-art, and extend our theory and algorithms to general convex objectives. Numerical evidence with hinge-loss support vector machines and Lasso confirm that the practice follows the theory.

[1]  Martin Jaggi,et al.  Primal-Dual Rates and Certificates , 2016, ICML.

[2]  Yonatan Wexler,et al.  Minimizing the Maximal Loss: How and Why , 2016, ICML.

[3]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[4]  Peter Richtárik,et al.  Stochastic Dual Coordinate Ascent with Adaptive Probabilities , 2015, ICML.

[5]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling , 2014, ArXiv.

[6]  Michèle Sebag,et al.  Adaptive coordinate descent , 2011, GECCO '11.

[7]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[8]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[9]  Pascal Bianchi,et al.  Adaptive Sampling for Incremental Optimization Using Stochastic Gradient Descent , 2015, ALT.

[10]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[11]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[12]  Yurii Nesterov,et al.  Subgradient methods for huge-scale optimization problems , 2013, Mathematical Programming.

[13]  Ürün Dogan,et al.  Coordinate Descent with Online Adaptation of Coordinate Frequencies , 2014, ArXiv.

[14]  Anton Osokin,et al.  Minding the Gaps for Block Frank-Wolfe Optimization of Structured SVMs , 2016, ICML.

[15]  Peter Richtárik,et al.  Coordinate descent with arbitrary sampling I: algorithms and complexity† , 2014, Optim. Methods Softw..

[16]  Nello Cristianini,et al.  The Kernel-Adatron Algorithm: A Fast and Simple Learning Procedure for Support Vector Machines , 1998, ICML.

[17]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[18]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[19]  Zeyuan Allen Zhu,et al.  Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling , 2015, ICML.

[20]  Stephen J. Wright Coordinate descent algorithms , 2015, Mathematical Programming.

[21]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[22]  Tuo Zhao,et al.  Pathwise Coordinate Optimization for Sparse Learning: Algorithm and Theory , 2014, ArXiv.

[23]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[24]  Tyler B. Johnson,et al.  Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization , 2015, ICML.

[25]  Tong Zhang,et al.  Proximal Stochastic Dual Coordinate Ascent , 2012, ArXiv.

[26]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..