Predictive data locality optimization for higher-order tensor computations

Automating locality optimization is still an open problem for compiler writers. Compiler-based approaches, guided by analytical cost models have achieved some success in matching high performance libraries on a restricted set of computations such as general matrix multiply (GEMM). On the other hand, library-based approaches may present some open scalability concerns. Recent developments in convolutional neural networks has seen an explosion of models, each with differing combinations of parameters. Manually tuning each of these configurations can take many development months. Further, these operations are called multiple times during machine learning training, which necessitates highly optimized implementations. 2D convolutional operators are unique in that they consist of 7-deep loop nests with different loops carrying reuse for different tensors, making the problem of identifying an optimal loop ordering hard. We devise a machine learning-based compiler which learns a regression model, correlating performance with the loop order. We integrate this model with other traditional compiler analysis for transformations such as loop unrolling and vectorization, relying on the MultiLevel Intermediate Representation (MLIR) compiler framework. We achieve an average speedup of 1.67x and 1.41x against oneDNN for 2D convolution forward and weight update kernels respectively. We are also at 0.88x and 0.96x the performance of oneDNN’s best performing implementation which applies additional data layout transformations.

[1]  Michael F. P. O'Boyle,et al.  Milepost GCC: Machine Learning Enabled Self-tuning Compiler , 2011, International Journal of Parallel Programming.

[2]  Thierry Moreau,et al.  Learning to Optimize Tensor Programs , 2018, NeurIPS.

[3]  Oscar R. Hernandez,et al.  HERCULES: A Pattern Driven Code Transformation System , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[4]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[5]  Gary S. Tyson,et al.  Exhaustive optimization phase order space exploration , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[6]  Chris Cummins,et al.  End-to-End Deep Learning of Optimization Heuristics , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[7]  Robert A. van de Geijn,et al.  Anatomy of High-Performance Many-Threaded Matrix Multiplication , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[8]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[9]  Gianluca Palermo,et al.  A Bayesian network approach for compiler auto-tuning for embedded processors , 2014, 2014 IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia).

[10]  John Wawrzynek,et al.  AutoPhase: Compiler Phase-Ordering for HLS with Deep Reinforcement Learning , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[11]  SWIRL++ : Evaluating Performance Models to Guide Code Transformation in Convolutional Neural Networks , 2019, LCPC.

[12]  Uday Bondhugula,et al.  MLIR: A Compiler Infrastructure for the End of Moore's Law , 2020, ArXiv.

[13]  Ion Stoica,et al.  NeuroVectorizer: end-to-end vectorization with deep reinforcement learning , 2020, CGO.

[14]  Michael F. P. O'Boyle,et al.  Using machine learning to focus iterative optimization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[15]  Armando Solar-Lezama,et al.  The three pillars of machine programming , 2018, MAPL@PLDI.

[16]  Yun Liang,et al.  FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System , 2020, ASPLOS.

[17]  Frédo Durand,et al.  Learning to optimize halide with tree search and random programs , 2019, ACM Trans. Graph..

[18]  Gianluca Palermo,et al.  A Survey on Compiler Autotuning using Machine Learning , 2018, ACM Comput. Surv..

[19]  L. Almagor,et al.  Finding effective compilation sequences , 2004, LCTES '04.

[20]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[21]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[22]  Rui Li,et al.  Analytical characterization and design space exploration for optimization of CNNs , 2021, ASPLOS.

[23]  Rajib Mall The Compiler Design Handbook: Optimizations and Machine Code Generation, Second Edition , 2007 .

[24]  Priti Shankar,et al.  The Compiler Design Handbook: Optimizations and Machine Code Generation , 2002, The Compiler Design Handbook.

[25]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..