Distributed Machine Learning through Heterogeneous Edge Systems

Many emerging AI applications request distributed machine learning (ML) among edge systems (e.g., IoT devices and PCs at the edge of the Internet), where data cannot be uploaded to a central venue for model training, due to their large volumes and/or security/privacy concerns. Edge devices are intrinsically heterogeneous in computing capacity, posing significant challenges to parameter synchronization for parallel training with the parameter server (PS) architecture. This paper proposes ADSP, a parameter synchronization scheme for distributed machine learning (ML) with heterogeneous edge systems. Eliminating the significant waiting time occurring with existing parameter synchronization models, the core idea of ADSP is to let faster edge devices continue training, while committing their model updates at strategically decided intervals. We design algorithms that decide time points for each worker to commit its model update, and ensure not only global model convergence but also faster convergence. Our testbed implementation and experiments show that ADSP outperforms existing parameter synchronization models significantly in terms of ML model convergence time, scalability and adaptability to large heterogeneity.

[1]  Mehdi Bennis,et al.  Wireless Network Intelligence at the Edge , 2018, Proceedings of the IEEE.

[2]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[3]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[4]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[5]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[6]  Shengen Yan,et al.  Timed Dataflow: Reducing Communication Overhead for Distributed Machine Learning Systems , 2016, 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS).

[7]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[8]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[9]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[10]  Bo Li,et al.  Round-Robin Synchronization: Mitigating Communication Bottlenecks in Parameter Servers , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[11]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[12]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[13]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[14]  T. S. Eugene Ng,et al.  The Impact of Virtualization on Network Performance of Amazon EC2 Data Center , 2010, 2010 Proceedings IEEE INFOCOM.

[15]  Chuan Wu,et al.  Optimus: an efficient dynamic resource scheduler for deep learning clusters , 2018, EuroSys.

[16]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[17]  Kin K. Leung,et al.  Adaptive Federated Learning in Resource Constrained Edge Computing Systems , 2018, IEEE Journal on Selected Areas in Communications.

[18]  Jiawei Jiang,et al.  Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[19]  Dan Wang,et al.  Data-driven Task Allocation for Multi-task Transfer Learning on the Edge , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[20]  Wei Wang,et al.  Stay Fresh: Speculative Synchronization for Fast Distributed Machine Learning , 2018, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[21]  Qun Li,et al.  eSGD: Communication Efficient Distributed Deep Learning on the Edge , 2018, HotEdge.

[22]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[23]  Onur Mutlu,et al.  Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds , 2017, NSDI.

[24]  Jianyu Wang,et al.  Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD , 2018, MLSys.

[25]  A. Krizhevsky Convolutional Deep Belief Networks on CIFAR-10 , 2010 .

[26]  Ioannis Mitliagkas,et al.  Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs , 2016, ArXiv.

[27]  Aaron Q. Li,et al.  Parameter Server for Distributed Machine Learning , 2013 .

[28]  Xianping Qu,et al.  Next Generation of DevOps: AIOps in Practice @Baidu , 2017 .

[29]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[30]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .