Reducing Communication in Proximal Newton Methods for Sparse Least Squares Problems

Proximal Newton methods are iterative algorithms that solve l1-regularized least squares problems. Distributed-memory implementation of these methods have become popular since they enable the analysis of large-scale machine learning problems. However, the scalability of these methods is limited by the communication overhead on modern distributed architecture. We propose a stochastic variance-reduced proximal method along with iteration-overlapping and Hessian-reuse to find an efficient trade-off between computation complexity and data communication. The proposed RC-SFSITA algorithm reduces latency costs by a factor of k without altering bandwidth costs. RC-SFISTA is implemented on both MPI and Spark and compared to the state-of-the-art framework, ProxCoCoA. The performance of RC-SFISTA is evaluated on 1 to 512 nodes for multiple benchmarks and demonstrates speedups of up to 12× compared to ProxCoCoA with scaling properties that outperform the original algorithm.

[1]  Jie Liu,et al.  Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting , 2014, IEEE Journal of Selected Topics in Signal Processing.

[2]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[3]  Quan Zhang,et al.  Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression , 2016, J. Mach. Learn. Res..

[4]  Jack Dongarra,et al.  Applied Mathematics Research for Exascale Computing , 2014 .

[5]  Erin Carson,et al.  Communication-Avoiding Krylov Subspace Methods in Theory and Practice , 2015 .

[6]  Atsushi Nitanda,et al.  Stochastic Proximal Gradient Descent with Acceleration Techniques , 2014, NIPS.

[7]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[8]  James Demmel,et al.  Avoiding Synchronization in First-Order Methods for Sparse Convex Optimization , 2017, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[9]  Anthony T. Chronopoulos,et al.  Parallel Iterative S-Step Methods for Unsymmetric Linear Systems , 1996, Parallel Comput..

[10]  Michael I. Jordan,et al.  CoCoA: A General Framework for Communication-Efficient Distributed Optimization , 2016, J. Mach. Learn. Res..

[11]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[12]  Emmanuel J. Candès,et al.  Templates for convex cone problems with applications to sparse signal recovery , 2010, Math. Program. Comput..

[13]  Mingyi Hong,et al.  Stochastic Proximal Gradient Consensus Over Random Networks , 2015, IEEE Transactions on Signal Processing.

[14]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[15]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[16]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[17]  S. Sundararajan,et al.  A distributed block coordinate descent method for training $l_1$ regularized linear classifiers , 2014, J. Mach. Learn. Res..

[18]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[19]  Antonin Chambolle,et al.  A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging , 2011, Journal of Mathematical Imaging and Vision.

[20]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[21]  Samuel H. Fuller,et al.  Computing Performance: Game Over or Next Level? , 2011, Computer.

[22]  Michael I. Jordan,et al.  L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework , 2015, ArXiv.

[23]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[24]  Nancy Wilkins-Diehr,et al.  XSEDE: Accelerating Scientific Discovery , 2014, Computing in Science & Engineering.

[25]  Mark W. Schmidt,et al.  Learning Graphical Model Structure Using L1-Regularization Paths , 2007, AAAI.

[26]  James Demmel,et al.  Avoiding Communication in Nonsymmetric Lanczos-Based Krylov Subspace Methods , 2013, SIAM J. Sci. Comput..

[27]  James Demmel,et al.  Avoiding communication in primal and dual block coordinate descent methods , 2016, SIAM J. Sci. Comput..

[28]  Zheng Chen,et al.  P-packSVM: Parallel Primal grAdient desCent Kernel SVM , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[29]  Michael A. Saunders,et al.  Proximal Newton-Type Methods for Minimizing Composite Functions , 2012, SIAM J. Optim..

[30]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[31]  Tyler B. Johnson,et al.  Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization , 2015, ICML.

[32]  James Demmel,et al.  Communication lower bounds and optimal algorithms for numerical linear algebra*† , 2014, Acta Numerica.

[33]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[34]  Dimitri P. Bertsekas,et al.  Incremental proximal methods for large scale convex optimization , 2011, Math. Program..

[35]  K. Lange,et al.  Coordinate descent algorithms for lasso penalized regression , 2008, 0803.3876.