CodeSeer: input-dependent code variants selection via machine learning

In high performance computing (HPC), scientific simulation codes are executed repeatedly with different inputs. The peak performance of these programs heavily depends on various compiler optimizations, which are often selected agnostically on program input or may be selected with sensitivity to just a single input. When subsequently executed, often with different inputs, performance may suffer for all or all but the one input tested, and for the latter potentially even compared to the O3 baseline. This work proposes a new auto-tuning framework, CodeSeer, to assess and improve existing input-agnostic or single-input centric rigid application tuning methods. Aided by CodeSeer, it is observed that modern HPC programs expose different types of input sensitivities, which present a significant challenge for prior work. To tackle this problem, CodeSeer proceeds with several machine learning models to predict the best per-input code variant on-the-fly. Our evaluation shows that CodeSeer incurs less than 0.01 second overhead, predicts the best code variant with a geometric mean precision 92% of the time and is capable of improving per-input peak performance to unprecedented levels.

[1]  Barbara M. Chapman,et al.  ARCS: Adaptive Runtime Configuration Selection for Power-Constrained OpenMP Applications , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[2]  Michael Garland,et al.  Architecture-Adaptive Code Variant Tuning , 2016, ASPLOS.

[3]  Fernando Magno Quintão Pereira,et al.  Generation of In-Bounds Inputs for Arrays in Memory-Unsafe Languages , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[4]  Rushil Anirudh,et al.  Performance Modeling under Resource Constraints Using Deep Transfer Learning , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Chun Chen,et al.  Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.

[6]  Tao Wang,et al.  Bootstrapping Parameter Space Exploration for Fast Tuning , 2018, ICS.

[7]  Ninghui Sun,et al.  FAST: A Fast Stencil Autotuning Framework Based On An Optimal-solution Space Model , 2015, ICS.

[8]  Gianluca Palermo,et al.  COBAYN: Compiler Autotuning Framework Using Bayesian Networks , 2016, ACM Trans. Archit. Code Optim..

[9]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[10]  Omer Khan,et al.  HeteroMap: A Runtime Performance Predictor for Efficient Processing of Graph Analytics on Heterogeneous Multi-Accelerators , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Nick Johnson,et al.  Input-aware auto-tuning for directive-based GPU programming , 2013, GPGPU@ASPLOS.

[13]  David Cox,et al.  Input-Aware Auto-Tuning of Compute-Bound HPC Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Xuehai Qian,et al.  Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing , 2018, ASPLOS.

[15]  Ninghui Sun,et al.  SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication , 2013, PLDI.

[16]  Rudolf Eigenmann,et al.  Fast and effective orchestration of compiler optimizations for automatic performance tuning , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[17]  Matthias S. Müller,et al.  SPEC OMP2012 - An Application Benchmark Suite for Parallel Systems Using OpenMP , 2012, IWOMP.

[18]  Martin Schulz,et al.  Caliper: Performance Introspection for HPC Software Stacks , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Rudolf Eigenmann,et al.  PEAK—a fast and effective performance tuning system via compiler optimization orchestration , 2008, TOPL.

[20]  Lieven Eeckhout,et al.  Evaluating iterative optimization across 1000 datasets , 2010, PLDI '10.

[21]  Michael F. P. O'Boyle,et al.  Using machine learning to focus iterative optimization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[22]  Michael Garland,et al.  Nitro: A Framework for Adaptive Code Variant Tuning , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[23]  Frank Mueller,et al.  Auto-generation and auto-tuning of 3D stencil codes on GPU clusters , 2012, CGO '12.

[24]  Frank Mueller,et al.  FuncyTuner: Auto-tuning Scientific Applications With Per-loop Compilation , 2019, ICPP.

[25]  Michael F. P. O'Boyle,et al.  Rapidly Selecting Good Compiler Optimizations using Performance Counters , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[26]  Martin Schulz,et al.  Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[27]  Frank Mueller,et al.  Hidp: A hierarchical data parallel language , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[28]  João M. P. Cardoso,et al.  A graph-based iterative compiler pass selection and phase ordering approach , 2016, LCTES.

[29]  YuZhibin,et al.  Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing , 2018 .

[30]  Gerhard Wellein,et al.  LIKWID: Lightweight Performance Tools , 2011, CHPC.

[31]  Robert Schöne,et al.  Towards Fine-grained Dynamic Tuning of HPC Applications on Modern Multi-Core Architectures , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  Peter N. Brown,et al.  KRIPKE - A MASSIVELY PARALLEL TRANSPORT MINI-APP , 2015 .

[33]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[34]  William Jalby,et al.  Piecewise Holistic Autotuning of Compiler and Runtime Parameters , 2016, Euro-Par.

[35]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[36]  Frank Mueller,et al.  Power tuning HPC jobs on power-constrained systems , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[37]  Ninghui Sun,et al.  An Autotuning Protocol to Rapidly Build Autotuners , 2019, TOPC.

[38]  Greg Bronevetsky,et al.  Data-Driven Performance Modeling of Linear Solvers for Sparse Matrices , 2016, 2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[39]  FrigoMatteo,et al.  A fast Fourier transform compiler , 1999 .

[40]  Qing Yi,et al.  POET: a scripting language for applying parameterized source‐to‐source program transformations , 2012, Softw. Pract. Exp..

[41]  Prasad A. Kulkarni,et al.  Exploiting phase inter-dependencies for faster iterative compiler optimization phase order searches , 2013, 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[42]  Kalyan Veeramachaneni,et al.  Autotuning algorithmic choice for input sensitivity , 2015, PLDI.