Pipeline Parallelism for Inference on Heterogeneous Edge Computing

Deep neural networks with large model sizes achieve state-of-the-art results for tasks in computer vision (CV) and natural language processing (NLP). However, these large-scale models are too computeor memory-intensive for resource-constrained edge devices. Prior works on parallel and distributed execution primarily focus on training—rather than inference—using homogeneous accelerators in data centers. We propose EdgePipe, a distributed framework for edge systems that uses pipeline parallelism to both speed up inference and enable running larger (and more accurate) models that otherwise cannot fit on single edge devices. EdgePipe achieves these results by using an optimal partition strategy that considers heterogeneity in compute, memory, and network bandwidth. Our empirical evaluation demonstrates that EdgePipe achieves 10.59× and 11.88× speedup using 16 edge devices for the ViT-Large and ViT-Huge models, respectively, with no accuracy loss. Similarly, EdgePipe improves ViT-Huge throughput by 3.93× over a 4-node baseline using 16 edge devices, which independently cannot fit the model in memory. Finally, we show up to 4.16× throughput improvement over the state-of-the-art PipeDream when using a heterogeneous set of devices.

[1]  Trevor N. Mudge,et al.  Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge , 2017, ASPLOS.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Alexander M. Rush,et al.  EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference , 2020, MICRO.

[4]  Muhammad Suryanegara,et al.  Design and Implementation of IoT-Based Smart Home Voice Commands for disabled people using Google Assistant , 2020, 2020 International Conference on Smart Technology and Applications (ICoSTA).

[5]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[6]  Yanzhi Wang,et al.  A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers , 2018, ECCV.

[7]  Massoud Pedram,et al.  JointDNN: An Efficient Training and Inference Engine for Intelligent Mobile Cloud Computing Services , 2018, IEEE Transactions on Mobile Computing.

[8]  Souvik Kundu,et al.  AttentionLite: Towards Efficient Self-Attention Models for Vision , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[10]  Song Han,et al.  HAT: Hardware-Aware Transformers for Efficient Natural Language Processing , 2020, ACL.

[11]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[12]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[13]  Diana Marculescu,et al.  Towards Efficient Model Compression via Learned Global Ranking , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Xianxiang Qin,et al.  Image semantic segmentation based on convolutional neural network and conditional random field , 2018, 2018 Tenth International Conference on Advanced Computational Intelligence (ICACI).

[16]  Zhijian Liu,et al.  Lite Transformer with Long-Short Range Attention , 2020, ICLR.

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Kurt Keutzer,et al.  HAWQV3: Dyadic Neural Network Quantization , 2020, ICML.

[19]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[20]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[21]  Peter A. H. Peterson,et al.  12th USENIX Workshop on Cyber Security Experimentation and Test (CSET '19) , 2019, login Usenix Mag..

[22]  Shen Li,et al.  PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers , 2021, ArXiv.

[23]  Andreas Gerstlauer,et al.  DeepThings: Distributed Adaptive Deep Learning Inference on Resource-Constrained IoT Edge Clusters , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[24]  Peter A. Beerel,et al.  A Tunable Robust Pruning Framework Through Dynamic Network Rewiring of DNNs , 2020, ArXiv.

[25]  IEEE Transactions on Computers , Computing in Science & Engineering.

[26]  Ildoo Kim,et al.  torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models , 2020, ArXiv.

[27]  Jungwook Choi,et al.  OPTIMUS: OPTImized matrix MUltiplication Structure for Transformer neural network accelerator , 2020, MLSys.

[28]  Frede Blaabjerg,et al.  Machine Learning in Wireless Sensor Networks for Smart Cities: A Survey , 2021, Electronics.

[29]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[30]  Lei Liu,et al.  Vehicular Edge Computing and Networking: A Survey , 2019, Mobile Networks and Applications.

[31]  Mohammad Hossein Samavatian,et al.  Adaptive parallel execution of deep neural networks on heterogeneous edge devices , 2019, SEC.

[32]  Tarik Taleb,et al.  Edge Computing for the Internet of Things: A Case Study , 2018, IEEE Internet of Things Journal.

[33]  Christopher De Sa,et al.  PipeMare: Asynchronous Pipeline Parallel DNN Training , 2019, ArXiv.

[34]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[35]  I. Stravinsky,et al.  Gestural Control of Real-Time Concatenative Synthesis in Luna Park Grégory Beller Computer Music , 2011 .

[36]  Hao Zhang,et al.  TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models , 2021, ICML.

[37]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[38]  Yiran Chen,et al.  MoDNN: Local distributed mobile computing system for Deep Neural Network , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[39]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[40]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[41]  Ching-Hsien Hsu,et al.  DNNOff: Offloading DNN-Based Intelligent IoT Applications in Mobile Edge Computing , 2021, IEEE Transactions on Industrial Informatics.

[42]  Jaesik Choi,et al.  HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism , 2020, USENIX ATC.