Stochastic Dual Coordinate Descent with Bandit Sampling

Coordinate descent methods minimize a cost function by updating a single decision variable (corresponding to one coordinate) at a time. Ideally, one would update the decision variable that yields the largest marginal decrease in the cost function. However, finding this coordinate would require checking all of them, which is not computationally practical. We instead propose a new adaptive method for coordinate descent. First, we define a lower bound on the decrease of the cost function when a coordinate is updated and, instead of calculating this lower bound for all coordinates, we use a multi-armed bandit algorithm to learn which coordinates result in the largest marginal decrease while simultaneously performing coordinate descent. We show that our approach improves the convergence of the coordinate methods (including parallel versions) both theoretically and experimentally.

[1]  Martin Jaggi,et al.  Primal-Dual Rates and Certificates , 2016, ICML.

[2]  Shai Shalev-Shwartz,et al.  Accelerated Mini-Batch Stochastic Dual Coordinate Ascent , 2013, NIPS.

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Peter Richtárik,et al.  Importance Sampling for Minibatches , 2016, J. Mach. Learn. Res..

[5]  John C. Duchi,et al.  Adaptive Sampling Probabilities for Non-Smooth Optimization , 2017, ICML.

[6]  Omar Besbes,et al.  Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards , 2014, Stochastic Systems.

[7]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[8]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[9]  Volkan Cevher,et al.  Faster Coordinate Descent via Adaptive Importance Sampling , 2017, AISTATS.

[10]  Inderjit S. Dhillon,et al.  Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems , 2012, 2012 IEEE 12th International Conference on Data Mining.

[11]  Peter Richt,et al.  Distributed Coordinate Descent Method for Learning with Big Data , 2016 .

[12]  Anton Osokin,et al.  Minding the Gaps for Block Frank-Wolfe Optimization of Structured SVMs , 2016, ICML.

[13]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[14]  Peter Richtárik,et al.  On optimal probabilities in stochastic coordinate descent methods , 2013, Optim. Lett..

[15]  Peter Richtárik,et al.  Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Truss Topology Design , 2011, OR.

[16]  Peter Richtárik,et al.  Stochastic Dual Coordinate Ascent with Adaptive Probabilities , 2015, ICML.

[17]  Patrick Thiran,et al.  Stochastic Optimization with Bandit Sampling , 2017, ArXiv.

[18]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling , 2014, ArXiv.

[19]  James Demmel,et al.  Asynchronous Parallel Greedy Coordinate Descent , 2016, NIPS.

[20]  Ion Necoara,et al.  Efficient parallel coordinate descent algorithm for convex optimization problems with separable constraints: Application to distributed MPC , 2013, 1302.3092.

[21]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[22]  Zhanxing Zhu,et al.  Stochastic Parallel Block Coordinate Descent for Large-Scale Saddle Point Problems , 2016, AAAI.