EAGLE: Expedited Device Placement with Automatic Grouping for Large Models

Advanced deep neural networks with large sizes are usually trained on a mixture of devices, including multiple CPUs and GPUs. The model training speed and efficiency are drastically impacted by the placement of operations on devices. To identify the optimal device placement, the state-of-the-art method is based on reinforcement learning with a hierarchical model, which partitions the operations into groups and then assigns each group to specific devices. However, due to the additional dimension of grouping decisions coupled with the placement, the reinforcement learning efficiency is greatly reduced. With modern neural networks growing in size and complexity, the issue of low efficiency and high cost in device placement is further aggravated. In this paper, we propose our design of EAGLE (Expedited Automatic Grouping for Large modEls), which integrates automatic grouping into reinforcement learning-based placement in an optimal way, to achieve the best possible training time performance for very large models. An extra RNN is introduced to transform parameters of the grouper into inputs of the placer, linking the originally separated parts together. Further optimizations have also been made in the network inputs. We have deployed and extensively evaluated EAGLE on InceptionV3, GNMT and BERT benchmarks. Compared with the state-of-the-art, the performance achieved by our design, measured by the per-step time with the resulted placement, is 2.7% and 18.7% better for GNMT and BERT, respectively. For Inception-V3, our design achieves the fastest speed in discovering the optimal placement.

[1]  Nikhil R. Devanur,et al.  PipeDream: Fast and Efficient Pipeline Parallel DNN Training , 2018, ArXiv.

[2]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Baochun Li,et al.  Post: Device Placement with Cross-Entropy Minimization and Proximal Policy Optimization , 2018, NeurIPS.

[4]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[5]  Baochun Li,et al.  Spotlight: Optimizing Device Placement for Training Deep Neural Networks , 2018, ICML.

[6]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[7]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[8]  Quoc V. Le,et al.  A Hierarchical Model for Device Placement , 2018, ICLR.

[9]  Azalia Mirhoseini,et al.  A Single-Shot Generalized Device Placement for Large Dataflow Graphs , 2020, IEEE Micro.

[10]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[11]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Sangeetha Abdu Jyothi,et al.  TicTac: Accelerating Distributed Deep Learning with Communication Scheduling , 2018, MLSys.

[15]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[16]  Gennady Pekhimenko,et al.  Priority-based Parameter Propagation for Distributed DNN Training , 2019, SysML.

[17]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[18]  François Pellegrini,et al.  Distillating knowledge about SCOTCH , 2009, Combinatorial Scientific Computing.

[19]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[21]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.