Accelerate Model Parallel Training by Using Efficient Graph Traversal Order in Device Placement

Modern neural networks require long training to reach decent performance on massive datasets. One common approach to speed up training is model parallelization, where large neural networks are split across multiple devices. However, different device placements of the same neural network lead to different training times. Most of the existing device placement solutions treat the problem as sequential decision-making by traversing neural network graphs and assigning their neurons to different devices. This work studies the impact of graph traversal order on device placement. In particular, we empirically study how different graph traversal order leads to different device placement, which in turn affects the training execution time. Our experiment results show that the best graph traversal order depends on the type of neural networks and their computation graphs features. In this work, we also provide recommendations on choosing graph traversal order in device placement for various neural network families to improve the training time in model parallelization.

[1]  Quoc V. Le,et al.  A Hierarchical Model for Device Placement , 2018, ICLR.

[2]  Azalia Mirhoseini,et al.  GDP: Generalized Device Placement for Dataflow Graphs , 2019, ArXiv.

[3]  Lorenzo Bruzzone,et al.  From Copernicus Big Data to Extreme Earth Analytics , 2019, EDBT.

[4]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[5]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[6]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Azalia Mirhoseini,et al.  A Single-Shot Generalized Device Placement for Large Dataflow Graphs , 2020, IEEE Micro.

[8]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[9]  Lorenzo Bruzzone,et al.  Monitoring of agricultural areas by using Sentinel 2 image time series and deep learning techniques , 2020, Remote Sensing.

[10]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[11]  Xiao Xiang Zhu,et al.  Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources , 2017, IEEE Geoscience and Remote Sensing Magazine.

[12]  Quoc V. Le,et al.  A graph placement methodology for fast chip design , 2021, Nature.

[13]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[14]  Dominique Beaini,et al.  Rethinking Graph Transformers with Spectral Attention , 2021, NeurIPS.

[15]  Ruben Mayer,et al.  The tensorflow partitioning and scheduling problem: it's the critical path! , 2017, ArXiv.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Amir H. Payberah,et al.  AutoAblation: Automated Parallel Ablation Studies for Deep Learning , 2021, EuroMLSys@EuroSys.

[18]  Hongzi Mao,et al.  Placeto: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning , 2019, NeurIPS.

[19]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[20]  Di He,et al.  Do Transformers Really Perform Bad for Graph Representation? , 2021, ArXiv.

[21]  Michael M. Bronstein,et al.  Understanding over-squashing and bottlenecks on graphs via curvature , 2021, ArXiv.

[22]  Baochun Li,et al.  Spotlight: Optimizing Device Placement for Training Deep Neural Networks , 2018, ICML.

[23]  Jure Leskovec,et al.  GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models , 2018, ICML.

[24]  Torbjørn Eltoft,et al.  Sea Ice Classification of SAR Imagery Based on Convolution Neural Networks , 2021, Remote. Sens..

[25]  Lorenzo Bruzzone,et al.  ExtremeEarth Meets Satellite Data From Space , 2021, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[26]  A. B. Kahn,et al.  Topological sorting of large networks , 1962, CACM.

[27]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[28]  Baochun Li,et al.  Post: Device Placement with Cross-Entropy Minimization and Proximal Policy Optimization , 2018, NeurIPS.

[29]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[30]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[31]  Quoc V. Le,et al.  Efficient Neural Architecture Search via Parameter Sharing , 2018, ICML.

[32]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[33]  Moritz Meister,et al.  Maggy: Scalable Asynchronous Parallel Hyperparameter Search , 2020, DistributedML@CoNEXT.

[34]  Amir H. Payberah,et al.  Graph Representation Matters in Device Placement , 2020, DIDL@Middleware.