论文信息 - Reducing Communication in Proximal Newton Methods for Sparse Least Squares Problems

Reducing Communication in Proximal Newton Methods for Sparse Least Squares Problems

Proximal Newton methods are iterative algorithms that solve l1-regularized least squares problems. Distributed-memory implementation of these methods have become popular since they enable the analysis of large-scale machine learning problems. However, the scalability of these methods is limited by the communication overhead on modern distributed architecture. We propose a stochastic variance-reduced proximal method along with iteration-overlapping and Hessian-reuse to find an efficient trade-off between computation complexity and data communication. The proposed RC-SFSITA algorithm reduces latency costs by a factor of k without altering bandwidth costs. RC-SFISTA is implemented on both MPI and Spark and compared to the state-of-the-art framework, ProxCoCoA. The performance of RC-SFISTA is evaluated on 1 to 512 nodes for multiple benchmarks and demonstrates speedups of up to 12× compared to ProxCoCoA with scaling properties that outperform the original algorithm.

[1] Jie Liu,et al. Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting , 2014, IEEE Journal of Selected Topics in Signal Processing.

[2] Marc Teboulle,et al. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[3] Quan Zhang,et al. Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression , 2016, J. Mach. Learn. Res..

[4] Jack Dongarra,et al. Applied Mathematics Research for Exascale Computing , 2014 .

[5] Erin Carson,et al. Communication-Avoiding Krylov Subspace Methods in Theory and Practice , 2015 .

[6] Atsushi Nitanda,et al. Stochastic Proximal Gradient Descent with Acceleration Techniques , 2014, NIPS.

[7] I. Daubechies,et al. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[8] James Demmel,et al. Avoiding Synchronization in First-Order Methods for Sparse Convex Optimization , 2017, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[9] Anthony T. Chronopoulos,et al. Parallel Iterative S-Step Methods for Unsymmetric Linear Systems , 1996, Parallel Comput..

[10] Michael I. Jordan,et al. CoCoA: A General Framework for Communication-Efficient Distributed Optimization , 2016, J. Mach. Learn. Res..

[11] Guillermo Sapiro,et al. Online dictionary learning for sparse coding , 2009, ICML '09.

[12] Emmanuel J. Candès,et al. Templates for convex cone problems with applications to sparse signal recovery , 2010, Math. Program. Comput..

[13] Mingyi Hong,et al. Stochastic Proximal Gradient Consensus Over Random Networks , 2015, IEEE Transactions on Signal Processing.

[14] Stephen P. Boyd,et al. Proximal Algorithms , 2013, Found. Trends Optim..

[15] James Demmel,et al. Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[16] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[17] S. Sundararajan,et al. A distributed block coordinate descent method for training $l_1$ regularized linear classifiers , 2014, J. Mach. Learn. Res..

[18] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[19] Antonin Chambolle,et al. A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging , 2011, Journal of Mathematical Imaging and Vision.

[20] Lin Xiao,et al. A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[21] Samuel H. Fuller,et al. Computing Performance: Game Over or Next Level? , 2011, Computer.

[22] Michael I. Jordan,et al. L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework , 2015, ArXiv.

[23] Furong Huang,et al. Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[24] Nancy Wilkins-Diehr,et al. XSEDE: Accelerating Scientific Discovery , 2014, Computing in Science & Engineering.

[25] Mark W. Schmidt,et al. Learning Graphical Model Structure Using L1-Regularization Paths , 2007, AAAI.

[26] James Demmel,et al. Avoiding Communication in Nonsymmetric Lanczos-Based Krylov Subspace Methods , 2013, SIAM J. Sci. Comput..

[27] James Demmel,et al. Avoiding communication in primal and dual block coordinate descent methods , 2016, SIAM J. Sci. Comput..

[28] Zheng Chen,et al. P-packSVM: Parallel Primal grAdient desCent Kernel SVM , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[29] Michael A. Saunders,et al. Proximal Newton-Type Methods for Minimizing Composite Functions , 2012, SIAM J. Optim..

[30] Ameet Talwalkar,et al. MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[31] Tyler B. Johnson,et al. Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization , 2015, ICML.

[32] James Demmel,et al. Communication lower bounds and optimal algorithms for numerical linear algebra*† , 2014, Acta Numerica.

[33] Trevor Hastie,et al. Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[34] Dimitri P. Bertsekas,et al. Incremental proximal methods for large scale convex optimization , 2011, Math. Program..

[35] K. Lange,et al. Coordinate descent algorithms for lasso penalized regression , 2008, 0803.3876.