FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training Framework for Heterogeneous Edge Devices

With the increased penetration and proliferation of Internet of Things (IoT) devices, there is a growing trend towards distributing the power of deep learning (DL) across edge devices rather than centralizing it in the cloud. This development enables better privacy preservation, real-time responses, and user-specific models. To deploy deep and complex models to edge devices with limited resources, model partitioning of deep neural networks (DNN) model is necessary, and has been widely studied. However, most of the existing literature only considers distributing the inference model while still relying centralized cloud infrastructure to generate this model through training. In this paper, we propose FTPipeHD, a novel DNN training framework that trains DNN models across distributed heterogeneous devices with fault tolerance mechanism. To accelerate the training with time-varying computing power of each device, we optimize the partition points dynamically according to realtime computing capacities. We also propose a novel weight redistribution approach that replicates the weights to both the neighboring nodes and the central node periodically, which combats the failure of multiple devices during training while incurring limited communication cost. Our numerical results demonstrate that FTPipeHD is 6.8x faster in training than the state of the art method when the computing capacity of the best device is 10x greater than the worst one. It is also shown that the proposed method is able to accelerate the training even with the existence of device failures.

[1]  Nicholas D. Lane,et al.  Can Deep Learning Revolutionize Mobile Sensing? , 2015, HotMobile.

[2]  Chuan Wu,et al.  DAPPLE: a pipelined data parallel approach for training large models , 2020, PPoPP.

[3]  Deepak Ganesan,et al.  CLIO: enabling automatic compilation of deep learning pipelines across IoT and cloud , 2020, MobiCom.

[4]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[5]  Denis Foley,et al.  Ultra-Performance Pascal GPU and NVLink Interconnect , 2017, IEEE Micro.

[6]  Shuochao Yao,et al.  Deep compressive offloading: speeding up neural network inference by trading edge computation for network latency , 2020, SenSys.

[7]  Rui Xu,et al.  BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training , 2020, ArXiv.

[8]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[9]  Saibal Mukhopadhyay,et al.  Edge-Host Partitioning of Deep Neural Networks with Feature Space Encoding for Resource-Constrained Internet-of-Things Platforms , 2018, 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[10]  Massoud Pedram,et al.  Energy and Performance Efficient Computation Offloading for Deep Neural Networks in a Mobile Cloud Computing Environment , 2018, ACM Great Lakes Symposium on VLSI.

[11]  Seunghak Lee,et al.  On Model Parallelization and Scheduling Strategies for Distributed Machine Learning , 2014, NIPS.

[12]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Jinwoo Shin,et al.  MetaSense: few-shot adaptation to untrained conditions in deep mobile sensing , 2019, SenSys.

[15]  Massoud Pedram,et al.  JointDNN: An Efficient Training and Inference Engine for Intelligent Mobile Cloud Computing Services , 2018, IEEE Transactions on Mobile Computing.

[16]  Shuai Wang,et al.  ElasticPipe: An Efficient and Dynamic Model-Parallel Solution to DNN Training , 2019, ScienceCloud@HPDC.

[17]  Tianjian Chen,et al.  Federated Machine Learning: Concept and Applications , 2019 .

[18]  Erdem Koyuncu,et al.  Respipe: Resilient Model-Distributed DNN Training at Edge Networks , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Ilias Leontiadis,et al.  SPINN: synergistic progressive inference of neural networks over device and cloud , 2020, MobiCom.

[20]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[21]  Paul Voigt,et al.  The EU General Data Protection Regulation (GDPR) , 2017 .

[22]  VALENTIN RADU,et al.  Multimodal Deep Learning for Activity and Context Recognition , 2018, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[23]  Li Shuangfeng,et al.  TensorFlow Lite: On-Device Machine Learning Framework , 2020 .

[24]  Zhi Zhou,et al.  HierTrain: Fast Hierarchical Edge AI Learning With Hybrid Parallelism in Mobile-Edge-Cloud Computing , 2020, IEEE Open Journal of the Communications Society.

[25]  Nikhil R. Devanur,et al.  PipeDream: Fast and Efficient Pipeline Parallel DNN Training , 2018, ArXiv.

[26]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[27]  Wotao Yin,et al.  XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training , 2019, ArXiv.

[28]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[29]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[30]  Jiangchuan Liu,et al.  When deep learning meets edge computing , 2017, 2017 IEEE 25th International Conference on Network Protocols (ICNP).

[31]  Bo Li,et al.  Gearing resource-poor mobile devices with powerful clouds: architectures, challenges, and applications , 2013, IEEE Wireless Communications.

[32]  Byung-Gon Chun,et al.  CloneCloud: elastic execution between mobile device and cloud , 2011, EuroSys '11.

[33]  Emiliano Miluzzo,et al.  A survey of mobile phone sensing , 2010, IEEE Communications Magazine.

[34]  Peter Richtárik,et al.  Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[35]  Trevor N. Mudge,et al.  Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge , 2017, ASPLOS.

[36]  Ashwin Ashok,et al.  Enabling Vehicular Applications using Cloud Services through Adaptive Computation Offloading , 2015, MCS '15.

[37]  Fei Wang,et al.  Deep learning for healthcare: review, opportunities and challenges , 2018, Briefings Bioinform..

[38]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.