Adaptive Sampling Probabilities for Non-Smooth Optimization

Standard forms of coordinate and stochastic gradient methods do not adapt to structure in data; their good behavior under random sampling is predicated on uniformity in data. When gradients in certain blocks of features (for coordinate descent) or examples (for SGD) are larger than others, there is a natural structure that can be exploited for quicker convergence. Yet adaptive variants often suffer nontrivial computational overhead. We present a framework that discovers and leverages such structural properties at a low computational cost. We employ a bandit optimization procedure that "learns" probabilities for sampling coordinates or examples in (non-smooth) optimization problems, allowing us to guarantee performance close to that of the optimal stationary sampling distribution. When such structures exist, our algorithms achieve tighter convergence guarantees than their non-adaptive counterparts, and we complement our analysis with experiments on several datasets.

[1]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Siddharth Gopal,et al.  Adaptive Sampling for SGD by Exploiting Side Information , 2016, ICML.

[3]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[4]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[5]  Tong Zhang,et al.  Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling , 2014, ArXiv.

[6]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[7]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[8]  Matthew J. Streeter,et al.  Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.

[9]  Zeyuan Allen Zhu,et al.  Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling , 2015, ICML.

[10]  R. Vershynin,et al.  A Randomized Kaczmarz Algorithm with Exponential Convergence , 2007, math/0702226.

[11]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[12]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[13]  Yin Tat Lee,et al.  Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[14]  Guanghui Lan,et al.  Stochastic Block Mirror Descent Methods for Nonsmooth and Stochastic Optimization , 2013, SIAM J. Optim..

[15]  Lin Xiao,et al.  On the complexity analysis of randomized block-coordinate descent methods , 2013, Mathematical Programming.

[16]  Tong Zhang,et al.  Proximal Stochastic Dual Coordinate Ascent , 2012, ArXiv.

[17]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling , 2014, ArXiv.

[18]  Mark W. Schmidt,et al.  Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection , 2015, ICML.

[19]  Amir Beck,et al.  On the Convergence of Block Coordinate Descent Type Methods , 2013, SIAM J. Optim..

[20]  Peter Richtárik,et al.  Stochastic Dual Coordinate Ascent with Adaptive Probabilities , 2015, ICML.

[21]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[22]  R. Stewart,et al.  Quantitative selection and parallel characterization of aptamers , 2013, Proceedings of the National Academy of Sciences.

[23]  Christopher Ré,et al.  Weighted SGD for ℓp Regression with Randomized Preconditioning , 2015, SODA.

[24]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[25]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[26]  Peter Richtárik,et al.  Importance Sampling for Minibatches , 2016, J. Mach. Learn. Res..

[27]  Peter Richtárik,et al.  Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[28]  Y. Nesterov,et al.  A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE OPTIMIZATION PROBLEMS WITH LINEAR CONSTRAINTS , 2013 .

[29]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[30]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[31]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[32]  Peter Richtárik,et al.  Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[33]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[34]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[35]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .