Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge

The computation for today's intelligent personal assistants such as Apple Siri, Google Now, and Microsoft Cortana, is performed in the cloud. This cloud-only approach requires significant amounts of data to be sent to the cloud over the wireless network and puts significant computational pressure on the datacenter. However, as the computational resources in mobile devices become more powerful and energy efficient, questions arise as to whether this cloud-only processing is desirable moving forward, and what are the implications of pushing some or all of this compute to the mobile devices on the edge. In this paper, we examine the status quo approach of cloud-only processing and investigate computation partitioning strategies that effectively leverage both the cycles in the cloud and on the mobile device to achieve low latency, low energy consumption, and high datacenter throughput for this class of intelligent applications. Our study uses 8 intelligent applications spanning computer vision, speech, and natural language domains, all employing state-of-the-art Deep Neural Networks (DNNs) as the core machine learning technique. We find that given the characteristics of DNN algorithms, a fine-grained, layer-level computation partitioning strategy based on the data and computation variations of each layer within a DNN has significant latency and energy advantages over the status quo approach. Using this insight, we design Neurosurgeon, a lightweight scheduler to automatically partition DNN computation between mobile devices and datacenters at the granularity of neural network layers. Neurosurgeon does not require per-application profiling. It adapts to various DNN architectures, hardware platforms, wireless networks, and server load levels, intelligently partitioning computation for best latency or best mobile energy. We evaluate Neurosurgeon on a state-of-the-art mobile development platform and show that it improves end-to-end latency by 3.1X on average and up to 40.7X, reduces mobile energy consumption by 59.5% on average and up to 94.7%, and improves datacenter throughput by 1.5X on average and up to 6.7X.

[1]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[4]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[5]  Xuehai Zhou,et al.  PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[6]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[7]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[8]  Nicholas D. Lane,et al.  DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices , 2016, 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN).

[9]  Quan Chen,et al.  DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[10]  Sandra Servia Rodríguez,et al.  Personal Model Training under Privacy Constraints , 2017, ArXiv.

[11]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Lingjia Tang,et al.  Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Qian Wang,et al.  AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Nicholas D. Lane,et al.  DeepEar: robust smartphone audio sensing in unconstrained acoustic environments using deep learning , 2015, UbiComp.

[15]  Alec Wolman,et al.  MCDNN: An Execution Framework for Deep Neural Networks on Resource-Constrained Devices , 2015 .

[16]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Ronald G. Dreslinski,et al.  A hybrid approach to offloading mobile image classification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Feng Qian,et al.  A close examination of performance and power characteristics of 4G LTE networks , 2012, MobiSys '12.

[19]  Alexander Gruenstein,et al.  Accurate and compact large vocabulary speech recognition on mobile devices , 2013, INTERSPEECH.

[20]  Quan Chen,et al.  Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers , 2016, ASPLOS.

[21]  Christoforos E. Kozyrakis,et al.  Towards energy proportionality for large-scale latency-critical workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[22]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[23]  Lingjia Tang,et al.  SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[24]  Xu Chen,et al.  COMET: Code Offload by Migrating Execution Transparently , 2012, OSDI.

[25]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[26]  Byung-Gon Chun,et al.  CloneCloud: elastic execution between mobile device and cloud , 2011, EuroSys '11.

[27]  Vijay Janapa Reddi,et al.  Mobile CPU's rise to power: Quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[28]  Qi Li,et al.  Thermal time shifting: Leveraging phase change materials to reduce cooling costs in warehouse-scale computers , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[29]  Junjie Wu,et al.  BigHouse: A simulation infrastructure for data center systems , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[30]  Daniel Mossé,et al.  Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[31]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[32]  Lingjia Tang,et al.  Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[33]  Lingjia Tang,et al.  Whare-map: heterogeneity in "homogeneous" warehouse-scale computers , 2013, ISCA.

[34]  Ramesh Govindan,et al.  Odessa: enabling interactive perception applications on mobile devices , 2011, MobiSys '11.

[35]  Ethan Katz-Bassett,et al.  Mobile Network Performance from User Devices: A Longitudinal, Multidimensional Analysis , 2014, PAM.

[36]  Karin Strauss,et al.  Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[37]  Alec Wolman,et al.  MAUI: making smartphones last longer with code offload , 2010, MobiSys '10.

[38]  Lingjia Tang,et al.  PowerChop: Identifying and Managing Non-critical Units in Hybrid Processor Architectures , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[39]  M. Slee,et al.  Thrift : Scalable Cross-Language Services Implementation , 2022 .

[40]  Lingjia Tang,et al.  Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[41]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[42]  Ronald G. Dreslinski,et al.  Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers , 2015, ASPLOS.

[43]  Ronald G. Dreslinski,et al.  Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).