Customized Monte Carlo Tree Search for LLVM/Polly's Composable Loop Optimization Transformations

Polly is the LLVM project's polyhedral loop optimizer. Recent user-directed loop transformation pragmas were proposed based on LLVM/Clang and Polly. The search space exposed by the transformation pragmas is a tree, wherein each node represents a specific combination of loop transformations that can be applied to the code resulting from the parent node's loop transformations. To find the best combination of these loop transformations, we have developed a search algorithm based on Monte Carlo tree search (MCTS). The algorithm consists of two phases: exploring loop transformations at different depths of the tree to identify promising regions in the tree search space and exploiting those regions by performing a local search. Moreover, a restart mechanism is used to avoid the MCTS getting trapped in a local solution. The best and worst solutions are transferred from the previous phases of the restarts to leverage the search history. We compare our approach with breadth-first, beam, global greedy, and random search methods using PolyBench benchmarks and ECP proxy applications. Experimental results show that our MCTS algorithm finds pragma combinations with a speedup of 2.3x over Polly's heuristic optimizations on average.

[1]  Thierry Moreau,et al.  Learning to Optimize Tensor Programs , 2018, NeurIPS.

[2]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[3]  Jonathan Ragan-Kelley,et al.  Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..

[4]  Ion Stoica,et al.  NeuroVectorizer: end-to-end vectorization with deep reinforcement learning , 2020, CGO.

[5]  Torsten Hoefler,et al.  Polly-ACC Transparent compilation to heterogeneous hardware , 2016, ICS.

[6]  Hadi Esmaeilzadeh,et al.  Reinforcement Learning and Adaptive Sampling for Optimized DNN Compilation , 2019, ArXiv.

[7]  Christian Lengauer,et al.  Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..

[8]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[9]  Giuseppe Paolo Ernesto Toffanin Zingales HalideTuner : generating and tuning halide schedules with Opentuner , 2015 .

[10]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[11]  Frédo Durand,et al.  Learning to optimize halide with tree search and random programs , 2019, ACM Trans. Graph..

[12]  P. Sadayappan,et al.  Using machine learning to improve automatic vectorization , 2012, TACO.

[13]  John Wawrzynek,et al.  ProTuner: Tuning Programs with Monte Carlo Tree Search , 2020, ArXiv.

[14]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[15]  Hal Finkel,et al.  User-Directed Loop-Transformations in Clang , 2018, 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC).

[16]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[17]  Hal Finkel,et al.  Autotuning Search Space for Loop Transformations , 2020, 2020 IEEE/ACM 6th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on Hierarchical Parallelism for Exascale Computing (HiPar).

[18]  H. Jaap van den Herik,et al.  Investigations with Monte Carlo Tree Search for Finding Better Multivariate Horner Schemes , 2013, ICAART.

[19]  J. Cavazos,et al.  Partnership for Advanced Computing in Europe Performance Improvement in Kernels by Guiding Compiler Auto-Vectorization Heuristics , 2014 .

[20]  Chun Chen,et al.  A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[21]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[22]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[23]  Michael F. P. O'Boyle,et al.  Milepost GCC: Machine Learning Enabled Self-tuning Compiler , 2011, International Journal of Parallel Programming.

[24]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[25]  Prasanna Balaprakash,et al.  Autotuning PolyBench Benchmarks with LLVM Clang/Polly Loop Optimization Pragmas Using Bayesian Optimization , 2020, 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[26]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[27]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[28]  Michel Steuwer,et al.  LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[29]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[30]  David A. Padua,et al.  A Language for the Compact Representation of Multiple Program Versions , 2005, LCPC.

[31]  Michael F. P. O'Boyle,et al.  Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[32]  Frédo Durand,et al.  Decoupling algorithms from schedules for easy optimization of image processing pipelines , 2012, ACM Trans. Graph..

[33]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[34]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[35]  Albert Cohen,et al.  On the Representation of Partially Specified Implementations and its Application to the Optimization of Linear Algebra Kernels on GPU , 2019, ArXiv.

[36]  Prasanna Balaprakash,et al.  Autotuning in High-Performance Computing Applications , 2018, Proceedings of the IEEE.

[37]  Hal Finkel,et al.  Design and Use of Loop-Transformation Pragmas , 2019, IWOMP.

[38]  Chris Cummins,et al.  End-to-End Deep Learning of Optimization Heuristics , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[39]  Sergei Gorlatch,et al.  ATF: A Generic Auto-Tuning Framework , 2017, 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[40]  Sergei Gorlatch,et al.  High performance stencil code generation with Lift , 2018, CGO.