Fine-Grained Energy and Performance Profiling framework for Deep Convolutional Neural Networks

There is a huge demand for on-device execution of deep learning algorithms on mobile and embedded platforms. These devices present constraints on the application due to limited resources and power. Hence, developing energy-efficient solutions to address this issue will require innovation in algorithmic design, software and hardware. Such innovation requires benchmarking and characterization of Deep Neural Networks based on performance and energy-consumption alongside accuracy. However, current benchmarks studies in existing deep learning frameworks (for example, Caffe, Tensorflow, Torch and others) are based on performance of these applications on high-end CPUs and GPUs. In this work, we introduce a benchmarking framework called "SyNERGY" to measure the energy and time of 11 representative Deep Convolutional Neural Networks on embedded platforms such as NVidia Jetson TX1. We integrate ARM's Streamline Performance Analyser with standard deep learning frameworks such as Caffe and CuDNNv5, to study the execution behaviour of current deep learning models at a fine-grained level (or specific layers) on image processing tasks. In addition, we build an initial multi-variable linear regression model to predict energy consumption of unseen neural network models based on the number of SIMD instructions executed and main memory accesses of the CPU cores of the TX1 with an average relative test error rate of 8.04 +/- 5.96 %. Surprisingly, we find that it is possible to refine the model to predict the number of SIMD instructions and main memory accesses solely from the application's Multiply-Accumulate (MAC) counts, hence, eliminating the need for actual measurements. Our predicted results demonstrate 7.08 +/- 6.0 % average relative error over actual energy measurements of all 11 networks tested, except MobileNet. By including MobileNet the average relative test error increases to 17.33 +/- 12.2 %.

[1]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[2]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[3]  James Demmel,et al.  the Parallel Computing Landscape , 2022 .

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Vivienne Sze,et al.  Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Graham D. Riley,et al.  Fine-grained energy profiling for deep convolutional neural networks on the Jetson TX1 , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[7]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[8]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[9]  Gu-Yeon Wei,et al.  Fathom: reference workloads for modern deep learning methods , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[10]  Forrest N. Iandola,et al.  Exploring the Design Space of Deep Convolutional Neural Networks at Large Scale , 2016, ArXiv.

[11]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[12]  Nicholas D. Lane,et al.  Can Deep Learning Revolutionize Mobile Sensing? , 2015, HotMobile.

[13]  Sparsh Mittal,et al.  A survey of techniques for improving energy efficiency in embedded computing systems , 2014, Int. J. Comput. Aided Eng. Technol..

[14]  Jia Deng,et al.  Large scale visual recognition , 2012 .

[15]  Joel Emer,et al.  Eyeriss: an Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks Accessed Terms of Use , 2022 .

[16]  Kunle Olukotun,et al.  DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .

[17]  Christian P. Robert,et al.  Machine Learning, a Probabilistic Perspective , 2014 .

[18]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[19]  Yoshua Bengio,et al.  Training deep neural networks with low precision multiplications , 2014 .

[20]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[21]  Sherief Reda,et al.  Runtime configurable deep neural networks for energy-accuracy trade-off , 2016, 2016 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[22]  Manuel Prieto,et al.  Survey of Energy-Cognizant Scheduling Techniques , 2013, IEEE Transactions on Parallel and Distributed Systems.

[23]  Eunhyeok Park,et al.  Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[26]  Qi Guo,et al.  BenchIP: Benchmarking Intelligence Processors , 2017, Journal of Computer Science and Technology.

[27]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[28]  Eugenio Culurciello,et al.  An Analysis of Deep Neural Network Models for Practical Applications , 2016, ArXiv.

[29]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[30]  Nicholas D. Lane,et al.  An Early Resource Characterization of Deep Learning on Wearables, Smartphones and Internet-of-Things Devices , 2015, IoT-App@SenSys.

[31]  Håkan Grahn,et al.  Identification of Energy Hotspots: A Case Study of the Very Fast Decision Tree , 2017, GPC.

[32]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[33]  Gopalakrishna Hegde,et al.  CaffePresso: An optimized library for Deep Learning on embedded accelerator-based platforms , 2016, 2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES).

[34]  H. Howie Huang,et al.  Performance Analysis of GPU-Based Convolutional Neural Networks , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[35]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[36]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Håkan Grahn,et al.  Energy Efficiency in Machine Learning: A position paper , 2017, SAIS.

[38]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[39]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[40]  Berin Martini,et al.  An efficient implementation of deep convolutional neural networks on a mobile coprocessor , 2014, 2014 IEEE 57th International Midwest Symposium on Circuits and Systems (MWSCAS).

[41]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[42]  Zhuowen Tu,et al.  Training Deeper Convolutional Networks with Deep Supervision , 2015, ArXiv.

[43]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[44]  Nicholas D. Lane,et al.  DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices , 2016, 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN).