Avoiding Synchronization in First-Order Methods for Sparse Convex Optimization

Parallel computing has played an important role in speeding up convex optimization methods for big data analytics and large-scale machine learning (ML). However, the scalability of these optimization methods is inhibited by the cost of communicating and synchronizing processors in a parallel setting. Iterative ML methods are particularly sensitive to communication cost since they often require communication every iteration. In this work, we extend well-known techniques from Communication-Avoiding Krylov subspace methods to first-order, block coordinate descent methods for Support Vector Machines and Proximal Least-Squares problems. Our Synchronization-Avoiding (SA) variants reduce the latency cost by a tunable factor of 's' at the expense of a factor of 's' increase in flops and bandwidth costs. We show that the SA-variants are numerically stable and can attain large speedups of up to 5.1x on a Cray XC30 supercomputer.

[1]  James Demmel,et al.  Avoiding communication in primal and dual block coordinate descent methods , 2016, SIAM J. Sci. Comput..

[2]  Zheng Chen,et al.  P-packSVM: Parallel Primal grAdient desCent Kernel SVM , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[3]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[4]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[5]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[6]  James Demmel,et al.  Matrix factorizations at scale: A comparison of scientific data analytics in spark and C+MPI using three case studies , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[7]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[8]  Anthony T. Chronopoulos A class of parallel iterative methods implemented on multiprocessors , 1987 .

[9]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[10]  Anthony T. Chronopoulos,et al.  s-step iterative methods for symmetric linear systems , 1989 .

[11]  Peter Richtárik,et al.  Distributed Coordinate Descent Method for Learning with Big Data , 2013, J. Mach. Learn. Res..

[12]  Chih-Jen Lin,et al.  Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines , 2008, J. Mach. Learn. Res..

[13]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[14]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[15]  Stephen J. Wright Coordinate descent algorithms , 2015, Mathematical Programming.

[16]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[17]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[18]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[19]  David R. Musicant,et al.  Successive overrelaxation for support vector machines , 1999, IEEE Trans. Neural Networks.

[20]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[21]  Anthony T. Chronopoulos,et al.  Parallel Iterative S-Step Methods for Unsymmetric Linear Systems , 1996, Parallel Comput..

[22]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[23]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[24]  Anthony T. Chronopoulos,et al.  An efficient nonsymmetric Lanczos method on parallel vector computers , 1992 .

[25]  James Demmel,et al.  Avoiding Communication in Proximal Methods for Convex Optimization Problems , 2017, ArXiv.

[26]  Peter Richtárik,et al.  Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[27]  James Demmel,et al.  Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies , 2016, ArXiv.

[28]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[29]  James Demmel,et al.  Communication lower bounds and optimal algorithms for numerical linear algebra*† , 2014, Acta Numerica.

[30]  Anthony T. Chronopoulos,et al.  On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy , 1989, Parallel Comput..

[31]  Atsushi Nitanda,et al.  Stochastic Proximal Gradient Descent with Acceleration Techniques , 2014, NIPS.

[32]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[33]  Peter Richtárik,et al.  Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[34]  Brian McWilliams,et al.  DUAL-LOCO: Distributing Statistical Estimation Using Random Projections , 2015, AISTATS.

[35]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[36]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[37]  Michael I. Jordan,et al.  L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework , 2015, ArXiv.

[38]  Le Song,et al.  CA-SVM: Communication-Avoiding Support Vector Machines on Distributed Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.