Look-Ahead Compile-Time Scheduling

To enhance the performance of memory-bound applications, hardware designs have been developed to hide memory latency, such as the out-of-order (OoO) execution engine, at the price of increased energy consumption. Contemporary processor cores span a wide range of performance and energy efficiency options: from fast and power-hungry OoO processors to efficient, but slower in-order processors. The more memory-bound an application is, the more aggressive the OoO execution engine has to be to hide memory latency. This proposal targets the middle ground, as seen in a simple OoO core, which strikes a good balance between performance and energy efficiency and currently dominates the market for mobile, hand-held devices and high-end embedded systems. We show that these simple, more energy-efficient OoO cores, equipped with the appropriate compile-time support, considerably boost the performance of single-threaded execution and reach new levels of performance for memorybound applications. Clairvoyance generates code that is able to hide memory latency and better utilize the OoO engine, thus delivering higher performance at lower energy. To this end, Clairvoyance overcomes restrictions which yielded conventional compile-time techniques impractical: (i) statically unknown dependencies, (ii) insufficient independent instructions, and (iii) register pressure. Thus, Clairvoyance achieves a geomean execution time improvement of 7% for memory-bound applications with a conservative approach and 13% with a speculative but safe approach, on top of standard O3 optimizations, while maintaining compute-bound applications’ high-performance.

[1]  J. Goodman,et al.  Code scheduling and register allocation in large basic blocks , 1988, ICS '88.

[2]  Susan J. Eggers,et al.  Integrating register allocation and instruction scheduling for RISCs , 1991, ASPLOS IV.

[3]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[4]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[5]  Alexander Aiken,et al.  Resource-Constrained Software Pipelining , 1995, IEEE Trans. Parallel Distributed Syst..

[6]  Scott A. Mahlke,et al.  The Importance of Prepass Code Scheduling for Superscalar and Superpipelined Processors , 1995, IEEE Trans. Computers.

[7]  Pascal Sainrat,et al.  Multiple-block ahead branch predictors , 1996, ASPLOS VII.

[8]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[9]  Michael D. Smith,et al.  Better global scheduling using path profiles , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[10]  Thomas M. Conte,et al.  Treegion scheduling for wide issue processors , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[11]  Gang Chen,et al.  Effective instruction scheduling with limited registers , 2001 .

[12]  Krishna V. Palem,et al.  Adaptive Compiler Directed Prefetching for EPIC Processors , 2004, PDPTA.

[13]  Brian Fahs,et al.  Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[14]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[15]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[16]  Dean M. Tullsen,et al.  Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices , 2005, PLDI '05.

[17]  Scott A. Mahlke,et al.  The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.

[18]  Woody Lichtenstein,et al.  The multiflow trace scheduling compiler , 1993, The Journal of Supercomputing.

[19]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[20]  Alexander Aiken,et al.  How is aliasing used in systems software? , 2006, SIGSOFT '06/FSE-14.

[21]  Weifeng Zhang,et al.  Accelerating and Adapting Precomputation Threads for Effcient Prefetching , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[22]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[23]  Yun Zhang,et al.  Decoupled software pipelining creates parallelization opportunities , 2010, CGO '10.

[24]  Jaejin Lee,et al.  Performance characterization of the NAS Parallel Benchmarks in OpenCL , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[25]  Dean M. Tullsen,et al.  Inter-core prefetching for multicore processors using migrating helper threads , 2011, ASPLOS XVI.

[26]  Vincent Loechner,et al.  VMAD: A virtual machine for advanced dynamic analysis of programs , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[27]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[28]  Andreas Sembrant,et al.  Power-Sleuth: A Tool for Investigating Your Program's Power Behavior , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[29]  Vincent Loechner,et al.  VMAD: An Advanced Dynamic Program Analysis and Instrumentation Framework , 2012, CC.

[30]  David Black-Schaffer,et al.  Towards more efficient execution: a decoupled access-execute approach , 2013, ICS '13.

[31]  David Black-Schaffer,et al.  Fix the code. Don't tweak the hardware: A new compiler approach to Voltage-Frequency scaling , 2014, CGO '14.

[32]  Margaret Martonosi,et al.  Power-Efficient Computer Architectures: Recent Advances , 2014, Power-Efficient Computer Architectures: Recent Advances.

[33]  André Seznec,et al.  Efficient Out-of-Order Execution of Guarded ISAs , 2014, ACM Trans. Archit. Code Optim..

[34]  Lingjia Tang,et al.  Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[35]  Henk Corporaal,et al.  High-level software-pipelining in LLVM , 2015, SCOPES.

[36]  Eric Rotenberg,et al.  Control-Flow Decoupling: An Approach for Timely, Non-Speculative Branching , 2015, IEEE Transactions on Computers.

[37]  David Black-Schaffer,et al.  AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[38]  Erik Hagersten Multiversioned Decoupled Access-Execute: the Key to Energy-Efficient Compilation of General-Purpose Programs , 2016 .

[39]  Jingling Xue,et al.  Sparse flow-sensitive pointer analysis for multithreaded programs , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[40]  A. Jaleel Memory Characterization of Workloads Using Instrumentation-Driven Simulation A Pin-based Memory Characterization of the SPEC CPU 2000 and SPEC CPU 2006 Benchmark Suites , 2022 .