论文信息 - Optimal Sparsity-Sensitive Bounds for Distributed Mean Estimation - 字舞流文

Optimal Sparsity-Sensitive Bounds for Distributed Mean Estimation

We consider the problem of estimating the mean of a set of vectors, which are stored in a distributed system. This is a fundamental task with applications in distributed SGD and many other distributed problems, where communication is a main bottleneck for scaling up computations. We propose a new sparsity-aware algorithm, which improves previous results both theoretically and empirically. The communication cost of our algorithm is characterized by Hoyer's measure of sparseness. Moreover, we prove that the communication cost of our algorithm is information-theoretic optimal up to a constant factor in all sparseness regime. We have also conducted experimental studies, which demonstrate the advantages of our method and confirm our theoretical findings.

Ke Yi | Ziyue Huang | Yilei Wang | zengfeng Huang

[1] Tengyu Ma,et al. On Communication Cost of Distributed Statistical Estimation and Dimensionality , 2014, NIPS.

[2] David P. Woodruff,et al. Communication lower bounds for statistical estimation problems via a distributed data processing inequality , 2015, STOC.

[3] Patrik O. Hoyer,et al. Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[4] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[5] Nam Sung Kim,et al. GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training , 2018, NeurIPS.

[6] Peter Elias,et al. Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[7] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[8] Ji Liu,et al. Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[9] Dimitris S. Papailiopoulos,et al. ATOMO: Communication-efficient Learning via Atomic Sparsification , 2018, NeurIPS.

[10] Nikko Strom,et al. Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[11] Trishul M. Chilimbi,et al. Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[12] Tenkasi V. Ramabadran,et al. A coding scheme for m-out-of-n codes , 1990, IEEE Trans. Commun..

[13] Andrew Chi-Chih Yao,et al. Probabilistic computations: Toward a unified measure of complexity , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[14] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[15] Kenneth Heafield,et al. Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[16] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[17] Philipp Birken,et al. Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[18] Ananda Theertha Suresh,et al. Distributed Mean Estimation with Limited Communication , 2016, ICML.

[19] Thomas Hofmann,et al. Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[20] Laurent Massoulié,et al. Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks , 2017, ICML.

[21] Peter Richtárik,et al. Randomized Distributed Mean Estimation: Accuracy vs. Communication , 2016, Front. Appl. Math. Stat..

[22] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[23] Yi Zhou,et al. Communication-efficient algorithms for decentralized and stochastic optimization , 2017, Mathematical Programming.

[24] Stephen P. Boyd,et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[25] Martin J. Wainwright,et al. Information-theoretic lower bounds for distributed statistical estimation with communication constraints , 2013, NIPS.

[26] Tianbao Yang,et al. Stochastic Variance Reduced Gradient Methods by Sampling Extra Data with Replacement , 2017 .

[27] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.