论文信息 - Parallel Restarted SPIDER - Communication Efficient Distributed Nonconvex Optimization with Optimal Computation Complexity - 字舞流文

Parallel Restarted SPIDER - Communication Efficient Distributed Nonconvex Optimization with Optimal Computation Complexity

In this paper, we propose a distributed algorithm for stochastic smooth, non-convex optimization. We assume a worker-server architecture where $N$ nodes, each having $n$ (potentially infinite) number of samples, collaborate with the help of a central server to perform the optimization task. The global objective is to minimize the average of local cost functions available at individual nodes. The proposed approach is a non-trivial extension of the popular parallel-restarted SGD algorithm, incorporating the optimal variance-reduction based SPIDER gradient estimator into it. We prove convergence of our algorithm to a first-order stationary solution. The proposed approach achieves the best known communication complexity $O(\epsilon^{-1})$ along with the optimal computation complexity. For finite-sum problems (finite $n$), we achieve the optimal computation (IFO) complexity $O(\sqrt{Nn}\epsilon^{-1})$. For online problems ($n$ unknown or infinite), we achieve the optimal IFO complexity $O(\epsilon^{-3/2})$. In both the cases, we maintain the linear speedup achieved by existing methods. This is a massive improvement over the $O(\epsilon^{-2})$ IFO complexity of the existing approaches. Additionally, our algorithm is general enough to allow non-identical distributions of data across workers, as in the recently proposed federated learning paradigm.

Pramod K. Varshney | Ketan Rajawat | Prashant Khanduri | Pranay Sharma | Saikiran Bulusu | P. Varshney | K. Rajawat | Pranay Sharma | Prashant Khanduri | Saikiran Bulusu

[1] Rong Jin,et al. On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization , 2019, ICML.

[2] Pengtao Xie,et al. Strategies and Principles of Distributed Machine Learning on Big Data , 2015, ArXiv.

[3] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[4] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[5] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[6] Haoran Sun,et al. Improving the Sample and Communication Complexity for Decentralized Non-Convex Optimization: A Joint Gradient Estimation and Tracking Approach , 2019, ArXiv.

[7] Zeyuan Allen Zhu,et al. Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[8] Peter Richtárik,et al. Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[9] Eric Moulines,et al. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[10] Léon Bottou,et al. A Lower Bound for the Optimization of Finite Sums , 2014, ICML.

[11] Rong Jin,et al. On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.

[12] Ohad Shamir,et al. Optimal Distributed Online Prediction , 2011, ICML.

[13] Kenneth Heafield,et al. Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[14] Yi Zhou,et al. SpiderBoost and Momentum: Faster Stochastic Variance Reduction Algorithms , 2018 .

[15] Marten van Dijk,et al. Finite-sum smooth optimization with SARAH , 2019, Computational Optimization and Applications.

[16] Quanquan Gu,et al. Stochastic Nested Variance Reduced Gradient Descent for Nonconvex Optimization , 2018, NeurIPS.

[17] Sam Ade Jacobs,et al. Communication Quantization for Data-Parallel Training of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[18] Michael I. Jordan,et al. Less than a Single Pass: Stochastically Controlled Stochastic Gradient , 2016, AISTATS.

[19] Peng Jiang,et al. A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication , 2018, NeurIPS.

[20] Alexander J. Smola,et al. AIDE: Fast and Communication Efficient Distributed Optimization , 2016, ArXiv.

[21] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[22] Ohad Shamir,et al. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[23] Shenghuo Zhu,et al. Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning , 2018, AAAI.

[24] Alexander J. Smola,et al. Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[25] Sebastian U. Stich,et al. Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[26] Tao Lin,et al. Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.

[27] Tie-Yan Liu,et al. Convergence of Distributed Stochastic Variance Reduced Methods Without Sampling Extra Data , 2020, IEEE Transactions on Signal Processing.

[28] Yi Zhou,et al. SpiderBoost: A Class of Faster Variance-reduced Algorithms for Nonconvex Optimization , 2018, ArXiv.

[29] Saeed Ghadimi,et al. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[30] Tong Zhang,et al. SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[31] Farzin Haddadpour,et al. Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization , 2019, ICML.

[32] Jie Liu,et al. SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[33] Nathan Srebro,et al. Memory and Communication Efficient Distributed Stochastic Optimization with Minibatch Prox , 2017, COLT.

[34] Boi Faltings,et al. Protecting Privacy through Distributed Computation in Multi-agent Decision Making , 2013, J. Artif. Intell. Res..

[35] Yaoliang Yu,et al. Petuum: A New Platform for Distributed Machine Learning on Big Data , 2015, IEEE Trans. Big Data.

[36] Nathan Srebro,et al. Lower Bounds for Non-Convex Stochastic Optimization , 2019, ArXiv.

[37] Tianbao Yang,et al. Stochastic Variance Reduced Gradient Methods by Sampling Extra Data with Replacement , 2017 .

[38] Jie Liu,et al. Stochastic Recursive Gradient Algorithm for Nonconvex Optimization , 2017, ArXiv.

[39] Marten van Dijk,et al. Optimal Finite-Sum Smooth Non-Convex Optimization with SARAH , 2019, ArXiv.

[40] Georgios B. Giannakis,et al. LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning , 2018, NeurIPS.

[41] Ioannis Mitliagkas,et al. Parallel SGD: When does averaging help? , 2016, ArXiv.

[42] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[43] Alexander J. Smola,et al. Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.