论文信息 - Parameter Hub: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training

Parameter Hub: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training

Most work in the deep learning systems community has focused on faster inference, but arriving at a trained model requires lengthy experiments. Accelerating training lets developers iterate faster and come up with better models. DNN training is often seen as a compute-bound problem, best done in a single large compute node with many GPUs. As DNNs get bigger, training requires going distributed. Distributed deep neural network (DDNN) training constitutes an important workload on the cloud. Larger DNN models and faster compute engines shift training performance bottleneck from computation to communication. Our experiments show existing DNN training frameworks do not scale in a typical cloud environment due to insufficient bandwidth and inefficient parameter server software stacks. We propose PHub, a high performance parameter server (PS) software design that provides an optimized network stack and a streamlined gradient processing pipeline to benefit common PS setups, and PBox, a balanced, scalable central PS hardware that fully utilizes PHub capabilities. We show that in a typical cloud environment, PHub can achieve up to 3.8x speedup over state-of-theart designs when training ImageNet. We discuss future directions of integrating PHub with programmable switches for in-network aggregation during training, leveraging the datacenter network topology to reduce bandwidth usage and localize data movement.

Amar Phanishayee | Jacob Nelson | Luis Ceze | Liang Luo | Arvind Krishnamurthy

[1] Alexander J. Smola,et al. Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[2] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[3] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[4] Xiaozhou Li,et al. Be Fast, Cheap and in Control with SwitchKV , 2016, NSDI.

[5] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[6] Jacob Nelson,et al. IncBricks: Toward In-Network Computation with an In-Network Cache , 2017, ASPLOS.

[7] Qiang Wang,et al. Benchmarking State-of-the-Art Deep Learning Software Tools , 2016, 2016 7th International Conference on Cloud Computing and Big Data (CCBD).

[8] Alexander J. Smola,et al. An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[9] Jialin Li,et al. Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering , 2016, OSDI.

[10] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[11] Christopher Ré,et al. DimmWitted: A Study of Main-Memory Statistical Analytics , 2014, Proc. VLDB Endow..