AsyMo: scalable and efficient deep-learning inference on asymmetric mobile CPUs
暂无分享,去创建一个
Fengyuan Xu | Yunxin Liu | Manni Wang | Shaohua Ding | Ting Cao | Yunxin Liu | Shaohua Ding | Fengyuan Xu | Ting Cao | Manni Wang
[1] Xiaoning Wang,et al. FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures , 2020, IEEE Transactions on Parallel and Distributed Systems.
[2] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[3] Ting Cao,et al. Portable performance on Asymmetric Multicore Processors , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[4] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.
[5] Norman P. Jouppi,et al. Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..
[6] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[7] Carl Staelin,et al. lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.
[8] Jason Maassen,et al. Optimizing convolution operations on GPUs using adaptive tiling , 2014, Future Gener. Comput. Syst..
[9] Carole-Jean Wu,et al. Machine Learning at Facebook: Understanding Inference at the Edge , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[10] Hui Liu,et al. On-Demand Deep Model Compression for Mobile Devices: A Usage-Driven Model Selection Framework , 2018, MobiSys.
[11] Vivienne Sze,et al. Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Forrest N. Iandola,et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.
[13] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.
[14] Minjia Zhang,et al. DeepCPU: Serving RNN-based Deep Learning Models 10x Faster , 2018, USENIX Annual Technical Conference.
[15] Quoc V. Le,et al. Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[16] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[17] Xuanzhe Liu,et al. A First Look at Deep Learning Apps on Smartphones , 2018, WWW.
[18] Yuxiong He,et al. GRNN: Low-Latency and Scalable RNN Inference on GPUs , 2019, EuroSys.
[19] Jidong Zhai,et al. Collaborative Heterogeneity-Aware OS Scheduler for Asymmetric Multicore Processors , 2021, IEEE Transactions on Parallel and Distributed Systems.
[20] Chris Cummins,et al. Autotuning OpenCL Workgroup Size for Stencil Patterns , 2015, ArXiv.
[21] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[22] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.
[23] Nick Knupffer. Intel Corporation , 2018, The Grants Register 2019.
[24] Paolo Napoletano,et al. Benchmark Analysis of Representative Deep Neural Network Architectures , 2018, IEEE Access.
[25] J. Selvakumar,et al. Appropriate allocation of workloads on performance asymmetric multicore architectures via deep learning algorithms , 2020, Microprocess. Microsystems.
[26] Todd D. Millstein,et al. RERAN: Timing- and touch-sensitive record and replay for Android , 2013, 2013 35th International Conference on Software Engineering (ICSE).
[27] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.
[28] Monica S. Lam,et al. The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.
[29] Jack J. Dongarra,et al. A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.
[30] Yida Wang,et al. Optimizing CNN Model Inference on CPUs , 2018, USENIX Annual Technical Conference.
[31] Yang Hu,et al. Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[32] Anne C. Elster,et al. Machine Learning Based Auto-Tuning for Enhanced OpenCL Performance Portability , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.
[33] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[34] Nicolas Weber,et al. SOL: Effortless Device Support for AI Frameworks without Source Code Changes , 2020, 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID).
[35] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..
[36] Joel Emer,et al. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.
[37] Cong Liu,et al. PredJoule: A Timing-Predictable Energy Optimization Framework for Deep Neural Networks , 2018, 2018 IEEE Real-Time Systems Symposium (RTSS).
[38] Joel Emer,et al. A method to estimate the energy consumption of deep neural networks , 2017, 2017 51st Asilomar Conference on Signals, Systems, and Computers.
[39] Patrice Y. Simard,et al. High Performance Convolutional Neural Networks for Document Processing , 2006 .
[40] Cedric Nugteren,et al. CLTune: A Generic Auto-Tuner for OpenCL Kernels , 2015, 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip.