A Single-Shot Generalized Device Placement for Large Dataflow Graphs

With increasingly complex neural network architectures and heterogeneous device characteristics, finding a reasonable graph partitioning and device placement strategy is challenging. There have been prior attempts at learned approaches for solving device placement, these approaches are computationally expensive, unable to handle large graphs consisting over 50000 nodes, and do not generalize well to unseen graphs. To address all these limitations, we propose an efficient single-shot, generalized deep RL method (SGDP) based on a scalable sequential attention mechanism over a graph neural network that is transferable to new graphs. On a diverse set of representative deep learning models, our method on average achieves 20% improvement over human placement and 18% improvement over the prior art with 15× faster convergence. We are the first to demonstrate super human performance on 8-layer recurrent neural network language model and 8-layer GNMT consisting of over 50000 nodes, on 8-GPUs. We provide rationales and sensitivity study on model architecture selections.

[1]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[2]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[3]  Hongzi Mao,et al.  Placeto: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning , 2019, NeurIPS.

[4]  Bruno A. Olshausen,et al.  Superposition of many models into one , 2019, NeurIPS.

[5]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[6]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[7]  Vinod Nair,et al.  REGAL: Transfer Learning For Fast Optimization of Computation Graphs , 2019, ArXiv.

[8]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[9]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[10]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[11]  Yang Yang,et al.  Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[12]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[13]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[14]  Baochun Li,et al.  Spotlight: Optimizing Device Placement for Training Deep Neural Networks , 2018, ICML.

[15]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[16]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[17]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[18]  Quoc V. Le,et al.  A Hierarchical Model for Device Placement , 2018, ICLR.