论文信息 - Avoiding Synchronization in First-Order Methods for Sparse Convex Optimization - 字舞流文

Avoiding Synchronization in First-Order Methods for Sparse Convex Optimization

Parallel computing has played an important role in speeding up convex optimization methods for big data analytics and large-scale machine learning (ML). However, the scalability of these optimization methods is inhibited by the cost of communicating and synchronizing processors in a parallel setting. Iterative ML methods are particularly sensitive to communication cost since they often require communication every iteration. In this work, we extend well-known techniques from Communication-Avoiding Krylov subspace methods to first-order, block coordinate descent methods for Support Vector Machines and Proximal Least-Squares problems. Our Synchronization-Avoiding (SA) variants reduce the latency cost by a tunable factor of 's' at the expense of a factor of 's' increase in flops and bandwidth costs. We show that the SA-variants are numerically stable and can attain large speedups of up to 5.1x on a Cray XC30 supercomputer.

James Demmel | Michael W. Mahoney | Aditya Devarakonda | Kimon Fountoulakis | J. Demmel | K. Fountoulakis | Aditya Devarakonda

[1] James Demmel,et al. Avoiding communication in primal and dual block coordinate descent methods , 2016, SIAM J. Sci. Comput..

[2] Zheng Chen,et al. P-packSVM: Parallel Primal grAdient desCent Kernel SVM , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[3] Yoram Singer,et al. Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[4] J. Platt. Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[5] Marc Teboulle,et al. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[6] James Demmel,et al. Matrix factorizations at scale: A comparison of scientific data analytics in spark and C+MPI using three case studies , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[7] Yurii Nesterov,et al. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[8] Anthony T. Chronopoulos. A class of parallel iterative methods implemented on multiprocessors , 1987 .

[9] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[10] Anthony T. Chronopoulos,et al. s-step iterative methods for symmetric linear systems , 1989 .

[11] Peter Richtárik,et al. Distributed Coordinate Descent Method for Learning with Big Data , 2013, J. Mach. Learn. Res..

[12] Chih-Jen Lin,et al. Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines , 2008, J. Mach. Learn. Res..

[13] Anthony Skjellum,et al. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[14] Stephen P. Boyd,et al. Proximal Algorithms , 2013, Found. Trends Optim..

[15] Stephen J. Wright. Coordinate descent algorithms , 2015, Mathematical Programming.

[16] M. Yuan,et al. Model selection and estimation in regression with grouped variables , 2006 .

[17] I. Daubechies,et al. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[18] H. Zou,et al. Regularization and variable selection via the elastic net , 2005 .

[19] David R. Musicant,et al. Successive overrelaxation for support vector machines , 1999, IEEE Trans. Neural Networks.

[20] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[21] Anthony T. Chronopoulos,et al. Parallel Iterative S-Step Methods for Unsymmetric Linear Systems , 1996, Parallel Comput..

[22] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[23] Chih-Jen Lin,et al. A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[24] Anthony T. Chronopoulos,et al. An efficient nonsymmetric Lanczos method on parallel vector computers , 1992 .

[25] James Demmel,et al. Avoiding Communication in Proximal Methods for Convex Optimization Problems , 2017, ArXiv.

[26] Peter Richtárik,et al. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[27] James Demmel,et al. Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies , 2016, ArXiv.

[28] Yurii Nesterov,et al. Smooth minimization of non-smooth functions , 2005, Math. Program..

[29] James Demmel,et al. Communication lower bounds and optimal algorithms for numerical linear algebra*† , 2014, Acta Numerica.

[30] Anthony T. Chronopoulos,et al. On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy , 1989, Parallel Comput..

[31] Atsushi Nitanda,et al. Stochastic Proximal Gradient Descent with Acceleration Techniques , 2014, NIPS.

[32] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[33] Peter Richtárik,et al. Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[34] Brian McWilliams,et al. DUAL-LOCO: Distributing Statistical Estimation Using Random Projections , 2015, AISTATS.

[35] Charles L. Lawson,et al. Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[36] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[37] Michael I. Jordan,et al. L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework , 2015, ArXiv.

[38] Le Song,et al. CA-SVM: Communication-Avoiding Support Vector Machines on Distributed Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.