DLion: Decentralized Distributed Deep Learning in Micro-Clouds

Deep learning is a popular technique for building models from large quantities of input data for applications in many domains. With the proliferation of edge devices such as sensor and mobile devices, large volumes of data are generated at rapid pace all over the world. Migrating large amounts of data into centralized data center(s) over WAN environments is often infeasible due to cost, performance or privacy reasons. Moreover, there is an increasing need for incremental or online deep learning over newly generated data in realtime. These trends require rethinking of the traditional training approach to deep learning. To handle the computation on distributed input data, micro-clouds—small-scale clouds deployed near edge devices in many different locations— provide an attractive alternative for data locality reasons. However, existing distributed deep learning systems do not support training in micro-clouds, due to the unique characteristics and challenges in this environment. In this paper, we examine the key challenges of deep learning in micro-clouds: computation and network resource heterogeneity at interand intra micro-cloud levels and their scale. We present DLion, a decentralized distributed deep learning system for such environments. It employs techniques specifically designed to address the above challenges to reduce training time, enhance model accuracy, and provide system scalability. We have implemented a prototype of DLion in TensorFlow and our preliminary experiments show promising results towards achieving accurate and efficient distributed deep learning in micro-clouds.

[1]  Cordelia Schmid,et al.  End-to-End Incremental Learning , 2018, ECCV.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Carlo Curino,et al.  WANalytics: Geo-Distributed Analytics for a Data Intensive World , 2015, SIGMOD Conference.

[4]  Onur Mutlu,et al.  Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds , 2017, NSDI.

[5]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[6]  Kin K. Leung,et al.  When Edge Meets Learning: Adaptive Control for Resource-Constrained Distributed Machine Learning , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[7]  Feng Xia,et al.  Cloudlet deployment in local wireless networks: Motivation, architectures, applications, and open challenges , 2016, J. Netw. Comput. Appl..

[8]  Yuxin Peng,et al.  Error-Driven Incremental Learning in Deep Convolutional Neural Network for Large-Scale Image Classification , 2014, ACM Multimedia.

[9]  Rosangela de Fatima Pereira,et al.  Fog computing: Data analytics and cloud distributed processing on the network edges , 2016, 2016 35th International Conference of the Chilean Computer Science Society (SCCC).

[10]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[11]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[12]  Mahardhika Pratama,et al.  Autonomous Deep Learning: Incremental Learning of Denoising Autoencoder for Evolving Data Streams , 2018, ArXiv.

[13]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[15]  Aditya Akella,et al.  CLARINET: WAN-Aware Optimization for Analytics Queries , 2016, OSDI.

[16]  Christoph H. Lampert,et al.  iCaRL: Incremental Classifier and Representation Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Forrest N. Iandola,et al.  How to scale distributed deep learning? , 2016, ArXiv.

[18]  Seng Wai Loke,et al.  Supporting ubiquitous sensor-cloudlets and context-cloudlets: Programming compositions of context-aware systems for mobile users , 2012, Future Gener. Comput. Syst..

[19]  Ramesh K. Sitaraman,et al.  Optimizing Grouped Aggregation in Geo-Distributed Streaming Analytics , 2015, HPDC.

[20]  Houbing Song,et al.  Mobile Cloud Computing Model and Big Data Analysis for Healthcare Applications , 2016, IEEE Access.

[21]  Samvit Jain,et al.  Scaling Video Analytics Systems to Large Camera Deployments , 2018, HotMobile.

[22]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[23]  Paramvir Bahl,et al.  Low Latency Geo-distributed Data Analytics , 2015, SIGCOMM.

[24]  Mahadev Satyanarayanan,et al.  Cloudlets: at the leading edge of mobile-cloud convergence , 2014, 6th International Conference on Mobile Computing, Applications and Services.

[25]  Samee Ullah Khan,et al.  Potentials, trends, and prospects in edge technologies: Fog, cloudlet, mobile edge, and micro data centers , 2018, Comput. Networks.

[26]  Ronald Kemker,et al.  FearNet: Brain-Inspired Model for Incremental Learning , 2017, ICLR.

[27]  Amit Agarwal,et al.  CNTK: Microsoft's Open-Source Deep-Learning Toolkit , 2016, KDD.

[28]  Abhishek Chandra,et al.  Multi-Query Optimization in Wide-Area Streaming Analytics , 2018, SoCC.

[29]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[30]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[31]  Raul Castro Fernandez,et al.  Ako: Decentralised Deep Learning with Partial Gradient Exchange , 2016, SoCC.

[32]  Steven C. H. Hoi,et al.  Online Deep Learning: Learning Deep Neural Networks on the Fly , 2017, IJCAI.

[33]  Scott Shenker,et al.  Monarch: Gaining Command on Geo-Distributed Graph Analytics , 2018, HotCloud.

[34]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[35]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[36]  Aakanksha Chowdhery,et al.  Optasia: A Relational Platform for Efficient Large-Scale Video Analytics , 2016, SoCC.

[37]  Jianyu Wang,et al.  Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD , 2018, MLSys.

[38]  Hamed Haddadi,et al.  Deep Learning in Mobile and Wireless Networking: A Survey , 2018, IEEE Communications Surveys & Tutorials.

[39]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Mianxiong Dong,et al.  Learning IoT in Edge: Deep Learning for the Internet of Things with Edge Computing , 2018, IEEE Network.

[41]  Michael Garland,et al.  AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks , 2017, ArXiv.

[42]  Soo-Mook Moon,et al.  IONN: Incremental Offloading of Neural Network Computations from Mobile Devices to Edge Servers , 2018, SoCC.

[43]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[44]  Wendi B. Heinzelman,et al.  Cloud-Vision: Real-time face recognition using a mobile-cloudlet-cloud acceleration architecture , 2012, 2012 IEEE Symposium on Computers and Communications (ISCC).

[45]  Xiao Ma,et al.  Game-theoretic Analysis of Computation Offloading for Cloudlet-based Mobile Cloud Computing , 2015, MSWiM.

[46]  Keke Gai,et al.  Dynamic energy-aware cloudlet-based mobile cloud computing model for green computing , 2016, J. Netw. Comput. Appl..

[47]  Omer F. Rana,et al.  Edge Enhanced Deep Learning System for Large-Scale Video Stream Analytics , 2018, 2018 IEEE 2nd International Conference on Fog and Edge Computing (ICFEC).

[48]  Chinmay Hegde,et al.  Collaborative Deep Learning in Fixed Topology Networks , 2017, NIPS.

[49]  Mohamed Faten Zhani,et al.  On Using Micro-Clouds to Deliver the Fog , 2017, IEEE Internet Computing.

[50]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Sergey Levine,et al.  Deep Online Learning via Meta-Learning: Continual Adaptation for Model-Based RL , 2018, ICLR.

[52]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[53]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Zhuo Chen,et al.  Bandwidth-Efficient Live Video Analytics for Drones Via Edge Computing , 2018, 2018 IEEE/ACM Symposium on Edge Computing (SEC).

[55]  H. Robbins A Stochastic Approximation Method , 1951 .

[56]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[57]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[58]  Pangfeng Liu,et al.  Adaptive Communication for Distributed Deep Learning on Commodity GPU Cluster , 2018, 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).