Parallax: Automatic Data-Parallel Training of Deep Neural Networks

The employment of high-performance servers and GPU accelerators for training deep neural network models have greatly accelerated recent advances in machine learning (ML). ML frameworks, such as TensorFlow, MXNet, and Caffe2, have emerged to assist ML researchers to train their models in a distributed fashion. However, correctly and efficiently utilizing multiple machines and GPUs is still not a straightforward task for framework users due to the non-trivial correctness and performance challenges that arise in the distribution process. This paper introduces Parallax, a tool for automatic parallelization of deep learning training in distributed environments. Parallax not only handles the subtle correctness issues, but also leverages various optimizations to minimize the communication overhead caused by scaling out. Experiments show that Parallax built atop TensorFlow achieves scalable training throughput on multiple CNN and RNN models, while requiring little effort from its users.

[1]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Pengtao Xie,et al.  Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[4]  Kevin Duh,et al.  DyNet: The Dynamic Neural Network Toolkit , 2017, ArXiv.

[5]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[6]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[7]  Anthony Skjellum,et al.  Using MPI: portable parallel programming with the message-passing interface, 2nd Edition , 1999, Scientific and engineering computation series.

[8]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[9]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[10]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[14]  Amith R. Mamidala,et al.  MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for scaling Deep Learning , 2018, ArXiv.

[15]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[16]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[17]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[18]  Ji Liu,et al.  Staleness-Aware Async-SGD for Distributed Deep Learning , 2015, IJCAI.

[19]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[20]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[21]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[22]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[23]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[24]  Graham Neubig,et al.  On-the-fly Operation Batching in Dynamic Computation Graphs , 2017, NIPS.

[25]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[26]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[27]  Xin Yuan,et al.  Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..