SIEVE: Speculative Inference on the Edge with Versatile Exportation

This paper proposes SIEVE, Speculative Inference on the Edge with Versatile Exportation, which dynamically distributes CNN computation between the cloud and edge device based on the input data and environmental conditions to maximize efficiency and performance. A speculative CNN is created through aggressive precision reduction techniques to run most of the inferences on the edge device, while the original CNN is run on the cloud server. A runtime system directs each input to either the edge or cloud and decides whether to accept speculative inferences made on the edge or invoke recovery by replaying the inference on the cloud. Compared to the cloud-only approach, SIEVE reduces energy consumption by an average of 91%, 57% and 26% and increases performance by an average of 12.3×, 2.8× and 2.0× for 3G, LTE and WiFi connections without accuracy loss across a range of nine CNNs.

[1]  Feng Qian,et al.  A close examination of performance and power characteristics of 4G LTE networks , 2012, MobiSys '12.

[2]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[3]  Chonho Lee,et al.  A survey of mobile cloud computing: architecture, applications, and approaches , 2013, Wirel. Commun. Mob. Comput..

[4]  Alberto Delmas,et al.  Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How , 2018, ArXiv.

[5]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[6]  Feng Qian,et al.  An in-depth study of LTE: effect of network protocol and application behavior on performance , 2013, SIGCOMM.

[7]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[8]  Trevor N. Mudge,et al.  Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge , 2017, ASPLOS.

[9]  H. T. Kung,et al.  BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[10]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[11]  Alec Wolman,et al.  MCDNN: An Approximation-Based Execution Framework for Deep Stream Processing Under Resource Constraints , 2016, MobiSys.

[12]  Scott A. Mahlke,et al.  DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Near-compute Data Fission , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Scott A. Mahlke,et al.  Scalpel: Customizing DNN pruning to the underlying hardware parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[14]  Nicholas D. Lane,et al.  DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices , 2016, 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN).

[15]  Zhi Zhou,et al.  Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing , 2019, IEEE Transactions on Wireless Communications.

[16]  Alec Wolman,et al.  MAUI: making smartphones last longer with code offload , 2010, MobiSys '10.