MCompiler: A Synergistic Compilation Framework

This paper presents a meta-compilation framework, the MCompiler. The main idea is that different segments of a program can be compiled with different compilers/optimizers and combined into a single executable. The MCompiler can be used in a number of ways. It can generate an executable with higher performance than any individual compiler, because each compiler uses a specific, ordered set of optimization techniques and different profitability models and can, therefore, generate code significantly different from other compilers. Alternatively, the MCompiler can be used by researchers and compiler developers to evaluate their compiler implementation and compare it to results from other available compilers/optimizers. A code segment in this work is a loop nest, but other choices are possible. This work also investigates the use of Machine Learning to learn inherent characteristics of loop nests and then predict during compilation the most suited code optimizer for each loop nest in an application. This reduces the need for profiling applications as well as the compilation time. The results show that our framework improves the overall performance for applications over state-of-the-art compilers by a geometric mean of 1.96x for auto-vectorized code and 2.62x for auto-parallelized code. Parallel applications with OpenMP directives are also improved by the MCompiler, with a geometric mean performance improvement of 1.04x (up to 1.74x). The use of Machine Learning prediction achieves performance very close to the profiling-based search for choosing the most suited code optimizer: within 4% for auto-vectorized code and within 8% for auto-parallelized code. Finally, the MCompiler can be expanded to collect metrics other than performance to be used in optimization process. The example presented is collecting energy data.

[1]  Keshav Pingali,et al.  A singular loop transformation framework based on non-singular matrices , 1992, International Journal of Parallel Programming.

[2]  Uday Bondhugula,et al.  Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , 2008, CC.

[3]  Gianluca Palermo,et al.  MiCOMP: Mitigating the Compiler Phase-Ordering Problem Using Optimization Sub-Sequences and Machine Learning , 2017, TACO.

[4]  Michael F. P. O'Boyle,et al.  Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.

[5]  Zhi Chen,et al.  LORE: A loop repository for the evaluation of compilers , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[6]  David A. Padua,et al.  High-Speed Multiprocessors and Compilation Techniques , 1980, IEEE Transactions on Computers.

[7]  Jack J. Dongarra,et al.  Vectorizing compilers: a test suite and results , 1988, Proceedings. SUPERCOMPUTING '88.

[8]  Uday Bondhugula,et al.  PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System , 2015 .

[9]  Michael F. P. O'Boyle,et al.  Milepost GCC: Machine Learning Enabled Self-tuning Compiler , 2011, International Journal of Parallel Programming.

[10]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[11]  Yves Robert,et al.  Scheduling and Automatic Parallelization , 2000, Birkhäuser Boston.

[12]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[13]  Michael F. P. O'Boyle,et al.  Rapidly Selecting Good Compiler Optimizations using Performance Counters , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[14]  D. Qainlant,et al.  ROSE: Compiler Support for Object-Oriented Frameworks , 1999 .

[15]  Hal Finkel,et al.  Compiler Optimizations for OpenMP , 2018, IWOMP.

[16]  Martin Schulz,et al.  A regression-based approach to scalability prediction , 2008, ICS '08.

[17]  Mark Stephenson,et al.  Predicting unroll factors using supervised classification , 2005, International Symposium on Code Generation and Optimization.

[18]  Sunita Chandrasekaran,et al.  NAS Parallel Benchmarks for GPGPUs Using a Directive-Based Programming Model , 2014, LCPC.

[19]  Alexander V. Veidenbaum,et al.  Using Hardware Counters to Predict Vectorization , 2017, LCPC.

[20]  David A. Padua,et al.  Towards an Achievable Performance for the Loop Nests , 2018, LCPC.

[21]  Alexander V. Veidenbaum,et al.  Optimizing Program Performance via Similarity, Using a Feature-Agnostic Approach , 2013, APPT.

[22]  François Bodin,et al.  A Machine Learning Approach to Automatic Production of Compiler Heuristics , 2002, AIMSA.

[23]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[24]  David A. Padua,et al.  An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[25]  Monica S. Lam,et al.  Blocking and array contraction across arbitrarily nested loops using affine partitioning , 2001, PPoPP '01.

[26]  Christian Lengauer,et al.  Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..

[27]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[28]  Monica S. Lam,et al.  Maximizing Parallelism and Minimizing Synchronization with Affine Partitions , 1998, Parallel Comput..

[29]  Utpal Banerjee,et al.  Loop Transformations for Restructuring Compilers: The Foundations , 1993, Springer US.

[30]  Dirk Grunwald,et al.  OptiScope: Performance Accountability for Optimizing Compilers , 2009, 2009 International Symposium on Code Generation and Optimization.

[31]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[32]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[33]  Michael F. P. O'Boyle,et al.  Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.

[34]  Zhi Chen,et al.  An empirical study of the effect of source-level loop transformations on compiler stability , 2018, Proc. ACM Program. Lang..

[35]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[36]  P. Sadayappan,et al.  Using machine learning to improve automatic vectorization , 2012, TACO.

[37]  Monica S. Lam,et al.  An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.