Accelerated Stochastic Block Coordinate Descent with Optimal Sampling

We study the composite minimization problem where the objective function is the sum of two convex functions: one is the sum of a finite number of strongly convex and smooth functions, and the other is a general convex function that is non-differentiable. Specifically, we consider the case where the non-differentiable function is block separable and admits a simple proximal mapping for each block. This type of composite optimization is common in many data mining and machine learning problems, and can be solved by block coordinate descent algorithms. We propose an accelerated stochastic block coordinate descent (ASBCD) algorithm, which incorporates the incrementally averaged partial derivative into the stochastic partial derivative and exploits optimal sampling. We prove that ASBCD attains a linear rate of convergence. In contrast to uniform sampling, we reveal that the optimal non-uniform sampling can be employed to achieve a lower iteration complexity. Experimental results on different large-scale real data sets support our theory.

[1]  Chih-Jen Lin,et al.  Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines , 2008, J. Mach. Learn. Res..

[2]  Han Liu,et al.  Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery , 2009, ICML '09.

[3]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[4]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[5]  Peter Richtárik,et al.  Randomized Dual Coordinate Ascent with Arbitrary Sampling , 2014, ArXiv.

[6]  K. Lange,et al.  Coordinate descent algorithms for lasso penalized regression , 2008, 0803.3876.

[7]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[8]  Julien Mairal,et al.  Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning , 2014, SIAM J. Optim..

[9]  Hongbo Deng,et al.  Towards Mobile Query Auto-Completion: An Efficient Mobile Application-Aware Approach , 2016, WWW.

[10]  Lin Xiao,et al.  An Accelerated Randomized Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization , 2015, SIAM J. Optim..

[11]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[12]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[13]  R. Tibshirani,et al.  Strong rules for discarding predictors in lasso‐type problems , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[14]  Peter Richtárik,et al.  Stochastic Dual Coordinate Ascent with Adaptive Probabilities , 2015, ICML.

[15]  T. Hastie,et al.  SparseNet: Coordinate Descent With Nonconvex Penalties , 2011, Journal of the American Statistical Association.

[16]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[17]  Peter Richtárik,et al.  Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[18]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[19]  Yiming Wang,et al.  Accelerated Mini-batch Randomized Block Coordinate Descent Method , 2014, NIPS.

[20]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[21]  Peter Richtárik,et al.  Semi-stochastic coordinate descent , 2014, Optim. Methods Softw..

[22]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[23]  R. Tyrrell Rockafellar,et al.  Convergence Rates in Forward-Backward Splitting , 1997, SIAM J. Optim..

[24]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[25]  Shai Shalev-Shwartz,et al.  Accelerated Mini-Batch Stochastic Dual Coordinate Ascent , 2013, NIPS.

[26]  Jie Liu,et al.  Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting , 2015, IEEE Journal of Selected Topics in Signal Processing.

[27]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling , 2014, ArXiv.

[28]  P. Lions,et al.  Splitting Algorithms for the Sum of Two Nonlinear Operators , 1979 .

[29]  Jian Huang,et al.  COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION. , 2011, The annals of applied statistics.

[30]  Jinghui Chen,et al.  Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization , 2016, UAI.

[31]  S. Osher,et al.  Coordinate descent optimization for l 1 minimization with application to compressed sensing; a greedy algorithm , 2009 .

[32]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[33]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[34]  Wotao Yin,et al.  Block Stochastic Gradient Iteration for Convex and Nonconvex Optimization , 2014, SIAM J. Optim..

[35]  Peter Richtárik,et al.  Semi-Stochastic Gradient Descent Methods , 2013, Front. Appl. Math. Stat..

[36]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[37]  Tong Zhang,et al.  Proximal Stochastic Dual Coordinate Ascent , 2012, ArXiv.

[38]  Mark W. Schmidt,et al.  Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields , 2015, AISTATS.

[39]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[40]  Arindam Banerjee,et al.  Randomized Block Coordinate Descent for Online and Stochastic Optimization , 2014, ArXiv.

[41]  Peter Richtárik,et al.  Coordinate descent with arbitrary sampling I: algorithms and complexity† , 2014, Optim. Methods Softw..

[42]  Justin Domke,et al.  Finito: A faster, permutable incremental gradient method for big data problems , 2014, ICML.

[43]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[44]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[45]  Jean-Yves Audibert Optimization for Machine Learning , 1995 .

[46]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[47]  Guanghui Lan,et al.  Stochastic Block Mirror Descent Methods for Nonsmooth and Stochastic Optimization , 2013, SIAM J. Optim..

[48]  Lin Xiao,et al.  On the complexity analysis of randomized block-coordinate descent methods , 2013, Mathematical Programming.

[49]  Noah A. Smith,et al.  Predicting Risk from Financial Reports with Regression , 2009, NAACL.

[50]  R. Vershynin,et al.  A Randomized Kaczmarz Algorithm with Exponential Convergence , 2007, math/0702226.

[51]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[52]  Suvrit Sra,et al.  Large-scale randomized-coordinate descent methods with non-separable linear constraints , 2014, UAI.