Towards a Learning-Based Performance Modeling for Accelerating Deep Neural Networks

Emerging applications such as Deep Learning are often data-driven, thus traditional approaches based on auto-tuners are not performance effective across the wide range of inputs used in practice. In the present paper, we start an investigation of predictive models based on machine learning techniques in order to optimize Convolution Neural Networks (CNNs). As a use-case, we focus on the ARM Compute Library which provides three different implementations of the convolution operator at different numeric precision. Starting from a collation of benchmarks, we build and validate models learned by Decision Tree and naive Bayesian classifier. Preliminary experiments on Midgard-based ARM Mali GPU show that our predictive model outperforms all the convolution operators manually selected by the library.

[1]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Flavio Vella,et al.  On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond , 2018, ACM Trans. Archit. Code Optim..

[3]  Manuela M. Veloso,et al.  Learning to Predict Performance from Formula Modeling and Training Data , 2000, ICML.

[4]  M. E. Maron,et al.  Automatic Indexing: An Experimental Inquiry , 1961, JACM.

[5]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[6]  Wu-chun Feng,et al.  Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[7]  Andrea Formisano,et al.  Accelerating Energy Games Solvers on Modern Architectures , 2017, IA3@SC.

[8]  David Gregg,et al.  Parallel Multi Channel convolution using General Matrix Multiplication , 2017, 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[9]  Daniel Brand,et al.  MEC: Memory-efficient Convolution for Deep Neural Network , 2017, ICML.

[10]  Torsten Hoefler,et al.  Transparent Caching for RMA Systems , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[11]  Tianqi Chen,et al.  Optimizing Deep Learning Workloads on ARM GPU with TVM , 2018, ReQuEST@ASPLOS.

[12]  Massimo Bernaschi,et al.  Multilevel Parallelism for the Exploration of Large-Scale Graphs , 2018, IEEE Transactions on Multi-Scale Computing Systems.

[13]  Osvaldo Gervasi,et al.  A Simulation Framework for Efficient Resource Management on Hybrid Systems , 2015, 2015 IEEE 18th International Conference on Computational Science and Engineering.

[14]  Ben H. H. Juurlink,et al.  Autotuning Stencil Computations with Structural Ordinal Regression Learning , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[15]  Flavio Vella,et al.  Multi-objective autotuning of MobileNets across the full software/hardware stack , 2018, ReQuEST@ASPLOS.

[16]  Olivier Temam,et al.  Collective optimization: A practical collaborative approach , 2010, TACO.

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  Anne C. Elster,et al.  Machine Learning Based Auto-Tuning for Enhanced OpenCL Performance Portability , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[20]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[21]  Massimo Bernaschi,et al.  Dynamic Merging of Frontiers for Accelerating the Evaluation of Betweenness Centrality , 2018, ACM J. Exp. Algorithmics.

[22]  Cedric Nugteren,et al.  CLTune: A Generic Auto-Tuner for OpenCL Kernels , 2015, 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip.