A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

AbstractThe advent of data proliferation and electronic devices gets low execution time and energy consumption software in the spotlight. The key to optimizing software is the correct choice, order as well as parameters of optimization transformations that has remained an open problem in compilation research for decades for various reasons. First, most of the transformations are interdependent and thus addressing them separately is not effective. Second, it is very hard to couple the transformation parameters to the processor architecture (e.g., cache size) and algorithm characteristics (e.g., data reuse); therefore, compiler designers and researchers either do not take them into account at all or do it partly. Third, the exploration space, i.e., the set of all optimization configurations that have to be explored, is huge and thus searching is impractical. In this paper, the above problems are addressed for data-dominant affine loop kernels, delivering significant contributions. A novel methodology is presented reducing the exploration space of six code optimizations by many orders of magnitude. The objective can be execution time (ET), energy consumption (E) or the number of L1, L2 and main memory accesses. The exploration space is reduced in two phases: firstly, by applying a novel register blocking algorithm and a novel loop tiling algorithm and secondly, by computing the maximum and minimum ET/E values for each optimization set. The proposed methodology has been evaluated for both embedded and general-purpose CPUs and for seven well-known algorithms, achieving high memory access, speedup and energy consumption gain values (from 1.17 up to 40) over gcc compiler, hand-written optimized code and Polly. The exploration space from which the near-optimum parameters are selected is reduced from 17 up to 30 orders of magnitude.

[1]  Henk Corporaal,et al.  Layer assignment techniques for low energy in multi-layered memory organisations , 2003 .

[2]  Erik Brockmeyer,et al.  Layer assignment techniques for low energy in multi-layered memory organisations , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[3]  Sanjay V. Rajopadhye,et al.  Multi-level tiling: M for the price of one , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[4]  Francky Catthoor,et al.  Array Interleaving—An Energy-Efficient Data Layout Transformation , 2015, TODE.

[5]  Voros Nikolaos,et al.  Combining Software Cache Partitioning and Loop Tiling for Effective Shared Cache Management , 2018 .

[6]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[7]  Mary W. Hall,et al.  CHiLL : A Framework for Composing High-Level Loop Transformations , 2007 .

[8]  Gianluca Palermo,et al.  Predictive modeling methodology for compiler phase-ordering , 2016, PARMA-DITAM '16.

[9]  Gianluca Palermo,et al.  MiCOMP: Mitigating the Compiler Phase-Ordering Problem Using Optimization Sub-Sequences and Machine Learning , 2017, TACO.

[10]  Francky Catthoor,et al.  Incremental hierarchical memory size estimation for steering of loop transformations , 2007, TODE.

[11]  Nikos S. Voros,et al.  Combining Software Cache Partitioning and Loop Tiling for Effective Shared Cache Management , 2018, ACM Trans. Embed. Comput. Syst..

[12]  Chen Ding,et al.  Defensive loop tiling for shared cache , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[13]  Gary S. Tyson,et al.  Practical exhaustive optimization phase order exploration and evaluation , 2009, TACO.

[14]  Douglas L. Jones,et al.  Fast searches for effective optimization phase sequences , 2004, PLDI '04.

[15]  Mark Stephenson,et al.  Predicting unroll factors using supervised classification , 2005, International Symposium on Code Generation and Optimization.

[16]  Gianluca Palermo,et al.  COBAYN: Compiler Autotuning Framework Using Bayesian Networks , 2016, ACM Trans. Archit. Code Optim..

[17]  João M. P. Cardoso,et al.  Use of Previously Acquired Positioning of Optimizations for Phase Ordering Exploration , 2015, SCOPES.

[18]  Sameer Kulkarni,et al.  An evaluation of different modeling techniques for iterative compilation , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[19]  Michael F. P. O'Boyle,et al.  The effect of cache models on iterative compilation for combined tiling and unrolling , 2004, Concurr. Comput. Pract. Exp..

[20]  Pavlos Petoumenos,et al.  Minimizing the cost of iterative compilation with active learning , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[21]  Peter M. W. Knijnenburg,et al.  Automatic selection of compiler options using non-parametric inferential statistics , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[22]  Suresh Purini,et al.  Finding good optimization sequences covering program space , 2013, TACO.

[23]  Michael F. P. O'Boyle,et al.  Rapidly Selecting Good Compiler Optimizations using Performance Counters , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[24]  João M. P. Cardoso,et al.  Compiler Phase Ordering as an Orthogonal Approach for Reducing Energy Consumption , 2018, ArXiv.

[25]  Christian Lengauer,et al.  Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..

[26]  Wen-mei W. Hwu,et al.  Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications , 2010, International Journal of Parallel Programming.

[27]  Sameer Kulkarni,et al.  Mitigating the compiler optimization phase-ordering problem using machine learning , 2012, OOPSLA '12.

[28]  Michael F. P. O'Boyle,et al.  Evaluating Iterative Compilation , 2002, LCPC.

[29]  Gianluca Palermo,et al.  A Survey on Compiler Autotuning using Machine Learning , 2018, ACM Comput. Surv..

[30]  Ghassan Shobaki,et al.  Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach , 2013, ACM Trans. Archit. Code Optim..

[31]  Francky Catthoor,et al.  Survey of Low-Energy Techniques for Instruction Memory Organisations in Embedded Systems , 2012, Journal of Signal Processing Systems.

[32]  Meikang Qiu,et al.  Energy minimization with loop fusion and multi-functional-unit scheduling for multidimensional DSP , 2008, J. Parallel Distributed Comput..

[33]  Albert Cohen,et al.  Polyhedral-Model Guided Loop-Nest Auto-Vectorization , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[34]  Dawei Wang,et al.  APC: A Novel Memory Metric and Measurement Methodology for Modern Memory Systems , 2014, IEEE Transactions on Computers.

[35]  Markus Püschel,et al.  Offline library adaptation using automatically generated heuristics , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[36]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  Mahmut T. Kandemir,et al.  Optimizing shared cache behavior of chip multiprocessors , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Toshio Endo,et al.  An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral Compilation , 2019, ACM Trans. Archit. Code Optim..

[39]  Michele Tartara,et al.  Parallel iterative compilation: using MapReduce to speedup machine learning in compilers , 2012, MapReduce '12.

[40]  Albert Cohen,et al.  Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[41]  Prasanna Balaprakash,et al.  An Experimental Study of Global and Local Search Algorithms in Empirical Performance Tuning , 2012, VECPAR.

[42]  Michael F. P. O'Boyle,et al.  Automatic feature generation for machine learning-based optimising compilation , 2014, ACM Trans. Archit. Code Optim..

[43]  Albert Cohen,et al.  Iterative optimization in the polyhedral model: part ii, multidimensional time , 2008, PLDI '08.

[44]  Sally A. McKee,et al.  ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[45]  Kedar S. Namjoshi,et al.  Loopy: Programmable and Formally Verified Loop Transformations , 2016, SAS.

[46]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[47]  Mahmut T. Kandemir,et al.  On-chip cache hierarchy-aware tile scheduling for multicore machines , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[48]  Michael F. P. O'Boyle,et al.  Automatic Feature Generation for Machine Learning Based Optimizing Compilation , 2009, 2009 International Symposium on Code Generation and Optimization.

[49]  Richard W. Vuduc,et al.  POET: Parameterized Optimizations for Empirical Tuning , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[50]  Uday Bondhugula,et al.  Loop transformations: convexity, pruning and optimization , 2011, POPL '11.

[51]  Henk Corporaal,et al.  Trade-offs in loop transformations , 2009, TODE.

[52]  Sanjay V. Rajopadhye,et al.  Parameterized tiled loops for free , 2007, PLDI '07.

[53]  Stefano Crespi-Reghizzi,et al.  Continuous learning of compiler heuristics , 2013, TACO.

[54]  Xing Zhou,et al.  Hierarchical overlapped tiling , 2012, CGO '12.

[55]  Dawei Wang,et al.  Concurrent Average Memory Access Time , 2014, Computer.

[56]  Olaf Krzikalla,et al.  Scout: A Source-to-Source Transformator for SIMD-Optimizations , 2011, Euro-Par Workshops.

[57]  P. Sadayappan,et al.  Annotation-based empirical performance tuning using Orio , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[58]  Uday Bondhugula,et al.  PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System , 2015 .