AsyMo: scalable and efficient deep-learning inference on asymmetric mobile CPUs

On-device deep learning (DL) inference has attracted vast interest. Mobile CPUs are the most common hardware for on-device inference and many inference frameworks have been developed for them. Yet, due to the hardware complexity, DL inference on mobile CPUs suffers from two common issues: the poor performance scalability on the asymmetric multiprocessor, and energy inefficiency. We identify the root causes are improper task partitioning and unbalanced task distribution for the poor scalability, and unawareness of model behaviour for energy inefficiency. Based on that, we propose a novel technique called AsyMo for the thread pool implementation of DL frameworks to solve the two issues. The key design principle is to leverage the execution determinism of DL inference, and build an optimal execution plan offline by jointly considering model structures and hardware characteristics. For performance scalability, AsyMo implements cost-model-directed partitioning and asymmetry-aware task scheduling to properly divide and fairly schedule tasks on asymmetric CPUs. For energy saving, AsyMo determines the least-energy cost frequency based on data reuse rate of a model. AsyMo is evaluated on different models and DL frameworks. All gain substantial improvement. For example, AsyMo shows up to 46% performance and 37% energy-efficiency improvement for convolution-dominant models, and up to 97% performance and 1.22× energy-efficiency improvement for fully-connect-dominant models, compared to an optimized TensorFlow on off-the-shelf mobile CPUs.

[1]  Xiaoning Wang,et al.  FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures , 2020, IEEE Transactions on Parallel and Distributed Systems.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Ting Cao,et al.  Portable performance on Asymmetric Multicore Processors , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[4]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[5]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[6]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[7]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[8]  Jason Maassen,et al.  Optimizing convolution operations on GPUs using adaptive tiling , 2014, Future Gener. Comput. Syst..

[9]  Carole-Jean Wu,et al.  Machine Learning at Facebook: Understanding Inference at the Edge , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[10]  Hui Liu,et al.  On-Demand Deep Model Compression for Mobile Devices: A Usage-Driven Model Selection Framework , 2018, MobiSys.

[11]  Vivienne Sze,et al.  Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[13]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[14]  Minjia Zhang,et al.  DeepCPU: Serving RNN-based Deep Learning Models 10x Faster , 2018, USENIX Annual Technical Conference.

[15]  Quoc V. Le,et al.  Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[17]  Xuanzhe Liu,et al.  A First Look at Deep Learning Apps on Smartphones , 2018, WWW.

[18]  Yuxiong He,et al.  GRNN: Low-Latency and Scalable RNN Inference on GPUs , 2019, EuroSys.

[19]  Jidong Zhai,et al.  Collaborative Heterogeneity-Aware OS Scheduler for Asymmetric Multicore Processors , 2021, IEEE Transactions on Parallel and Distributed Systems.

[20]  Chris Cummins,et al.  Autotuning OpenCL Workgroup Size for Stencil Patterns , 2015, ArXiv.

[21]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[22]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[23]  Nick Knupffer Intel Corporation , 2018, The Grants Register 2019.

[24]  Paolo Napoletano,et al.  Benchmark Analysis of Representative Deep Neural Network Architectures , 2018, IEEE Access.

[25]  J. Selvakumar,et al.  Appropriate allocation of workloads on performance asymmetric multicore architectures via deep learning algorithms , 2020, Microprocess. Microsystems.

[26]  Todd D. Millstein,et al.  RERAN: Timing- and touch-sensitive record and replay for Android , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[27]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[28]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[29]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[30]  Yida Wang,et al.  Optimizing CNN Model Inference on CPUs , 2018, USENIX Annual Technical Conference.

[31]  Yang Hu,et al.  Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[32]  Anne C. Elster,et al.  Machine Learning Based Auto-Tuning for Enhanced OpenCL Performance Portability , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Nicolas Weber,et al.  SOL: Effortless Device Support for AI Frameworks without Source Code Changes , 2020, 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID).

[35]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[36]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[37]  Cong Liu,et al.  PredJoule: A Timing-Predictable Energy Optimization Framework for Deep Neural Networks , 2018, 2018 IEEE Real-Time Systems Symposium (RTSS).

[38]  Joel Emer,et al.  A method to estimate the energy consumption of deep neural networks , 2017, 2017 51st Asilomar Conference on Signals, Systems, and Computers.

[39]  Patrice Y. Simard,et al.  High Performance Convolutional Neural Networks for Document Processing , 2006 .

[40]  Cedric Nugteren,et al.  CLTune: A Generic Auto-Tuner for OpenCL Kernels , 2015, 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip.