Auto-Split: A General Framework of Collaborative Edge-Cloud AI

In many industry scale applications, large and resource consuming machine learning models reside in powerful cloud servers. At the same time, large amounts of input data are collected at the edge of cloud. The inference results are also communicated to users or passed to downstream tasks at the edge. The edge often consists of a large number of low-power devices. It is a big challenge to design industry products to support sophisticated deep model deployment and conduct model inference in an efficient manner so that the model accuracy remains high and the end-to-end latency is kept low. This paper describes the techniques and engineering practice behind Auto-Split, an edge-cloud collaborative prototype of Huawei Cloud. This patented technology is already validated on selected applications, is on its way for broader systematic edge-cloud application integration, and is being made available for public use as an automated pipeline service for end-to-end cloud-edge collaborative intelligence deployment. To the best of our knowledge, there is no existing industry product that provides the capability of Deep Neural Network (DNN) splitting.

[1]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[2]  H. T. Kung,et al.  BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[3]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[4]  Zhiru Zhang,et al.  Improving Neural Network Quantization without Retraining using Outlier Channel Splitting , 2019, ICML.

[5]  Jianxin Wu,et al.  Adaptive Feeding: Achieving Fast and Accurate Detections by Adaptively Combining Object Detectors , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Vivienne Sze,et al.  Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[7]  Daniel Soudry,et al.  Post training 4-bit quantization of convolutional networks for rapid-deployment , 2018, NeurIPS.

[8]  Alexander S. Ecker,et al.  Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming , 2019, ArXiv.

[9]  Steven K. Esser,et al.  Learned Step Size Quantization , 2019, ICLR.

[10]  Yujeong Choi,et al.  PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[11]  Zhi Zhou,et al.  Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing , 2019, IEEE Transactions on Wireless Communications.

[12]  Yuandong Tian,et al.  Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search , 2018, ArXiv.

[13]  Niraj K. Jha,et al.  ChamNet: Towards Efficient Network Design Through Platform-Aware Model Adaptation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Bernd Girod,et al.  Towards Effective 2-bit Quantization: Pareto-optimal Bit Allocation for Deep CNNs Compression , 2019 .

[15]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[16]  Guy Jacob,et al.  Neural Network Distiller: A Python Package For DNN Compression Research , 2019, ArXiv.

[17]  Bo Chen,et al.  NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications , 2018, ECCV.

[18]  Kaisheng Ma,et al.  SCAN: A Scalable Neural Networks Framework Towards Compact and Efficient Models , 2019, NeurIPS.

[19]  Swagath Venkataramani,et al.  PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.

[20]  G. Hua,et al.  LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks , 2018, ECCV.

[21]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[22]  Brucek Khailany,et al.  Timeloop: A Systematic Approach to DNN Accelerator Evaluation , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[23]  Dan Wang,et al.  Dynamic Adaptive DNN Surgery for Inference Acceleration on the Edge , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[24]  Kurt Keutzer,et al.  HAWQ: Hessian AWare Quantization of Neural Networks With Mixed-Precision , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Dan Alistarh,et al.  Model compression via distillation and quantization , 2018, ICLR.

[26]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Kurt Keutzer,et al.  ZeroQ: A Novel Zero Shot Quantization Framework , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yair Shoham,et al.  Efficient bit allocation for an arbitrary set of quantizers [speech coding] , 1988, IEEE Trans. Acoust. Speech Signal Process..

[29]  Asit K. Mishra,et al.  Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy , 2017, ICLR.

[30]  Luca Benini,et al.  PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors , 2019, Philosophical Transactions of the Royal Society A.

[31]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[32]  Lothar Thiele,et al.  Rethinking Pruning for Accelerating Deep Inference At the Edge , 2020, KDD.

[33]  Ji Liu,et al.  Energy-Constrained Compression for Deep Neural Networks via Weighted Sparse Projection and Layer Input Masking , 2018, ICLR.

[34]  Shigeng Zhang,et al.  Towards Real-time Cooperative Deep Inference over the Cloud and Edge End Devices , 2020, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[35]  Luca Benini,et al.  Memory-Driven Mixed Low Precision Quantization For Enabling Deep Network Inference On Microcontrollers , 2019, MLSys.

[36]  Philip S. Yu,et al.  Not Just Privacy: Improving Performance of Private Deep Learning in Mobile Cloud , 2018, KDD.

[37]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[38]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[39]  Xiangyu Zhang,et al.  Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Raghuraman Krishnamoorthi,et al.  Quantizing deep convolutional networks for efficient inference: A whitepaper , 2018, ArXiv.

[41]  Bo Chen,et al.  MnasNet: Platform-Aware Neural Architecture Search for Mobile , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Avi Mendelson,et al.  Loss Aware Post-training Quantization , 2019, ArXiv.

[43]  Matthew Mattina,et al.  SCALE-Sim: Systolic CNN Accelerator , 2018, ArXiv.

[44]  Zhijian Liu,et al.  HAQ: Hardware-Aware Automated Quantization With Mixed Precision , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yoni Choukroun,et al.  Low-bit Quantization of Neural Networks for Efficient Inference , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[46]  Ilias Leontiadis,et al.  SPINN: synergistic progressive inference of neural networks over device and cloud , 2020, MobiCom.

[47]  Kilian Q. Weinberger,et al.  Multi-Scale Dense Networks for Resource Efficient Image Classification , 2017, ICLR.

[49]  Elad Hoffer,et al.  ACIQ: Analytical Clipping for Integer Quantization of neural networks , 2018, ArXiv.

[50]  Trevor N. Mudge,et al.  Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge , 2017, ASPLOS.