A study of source-level compiler algorithms for automatic construction of pre-execution code

Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This article investigates several source-to-source C compilers for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. We present an aggressive profile-driven compiler that employs three powerful algorithms for code extraction. First, program slicing removes non-critical code for computing cache-missing memory references. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, speculative loop parallelization generates thread-level parallelism to tolerate the latency of blocking loads. In addition, we present four "reduced" compilers that employ less aggressive algorithms to simplify compiler implementation. Our reduced compilers rely on back-end code optimizations rather than program slicing to remove non-critical code, and use compile-time heuristics rather than profiling to approximate runtime information (e.g., cache-miss and loop-trip counts).We prototype our algorithms on the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [Lyle and Wallace 1997]. Using our prototype, we undertake a performance evaluation of our compilers on a detailed architectural simulator of an 8-way out-of-order SMT processor with 4 hardware contexts, and 13 applications selected from the SPEC and Olden benchmark suites. Our most aggressive compiler improves the performance of 10 out of 13 applications, reducing execution time by 20.9%. Across all 13 applications, our aggressive compiler achieves a harmonic average speedup of 17.6%. For our reduced compilers, eliminating program slicing and relying on back-end optimizations degrades performance minimally, suggesting that effective pre-execution compilers can be built without program slicing. Furthermore, without cache-miss profiles, we still achieve good speedup, 15.5%, but without loop-trip count profiles, we achieve a speedup of only 7.7%. Finally, our results show compiler-based pre-execution can benefit multiprogrammed workloads. Simultaneously executing applications achieve higher throughput with pre-execution compared to no pre-execution. Due to contention for hardware contexts, however, time-slicing outperforms simultaneous execution in some cases where individual applications make heavy use of pre-execution threads.

[1]  Christopher Hughes,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, ISCA 2001.

[2]  C. Zilles,et al.  Understanding the backward slices of performance degrading instructions , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[3]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[4]  David Binkley,et al.  Unravel:: a case tool to assist evaluation of high integrity software , 1995 .

[5]  Gurindar S. Sohi,et al.  Speculative data-driven multithreading , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[6]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[7]  Sarita V. Adve,et al.  Code transformations to improve memory parallelism , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[8]  David A. Padua,et al.  High-Speed Multiprocessors and Compilation Techniques , 1980, IEEE Transactions on Computers.

[9]  Todd C. Mowry,et al.  Tolerating latency in multiprocessors through compiler-inserted prefetching , 1998, TOCS.

[10]  John Paul Shen,et al.  Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[11]  Jignesh M. Patel,et al.  Data prefetching by dependence graph precomputation , 2001, ISCA 2001.

[12]  Yale N. Patt,et al.  Simultaneous subordinate microthreading (SSMT) , 1999, ISCA.

[13]  Dean M. Tullsen,et al.  Supporting fine-grained synchronization on a simultaneous multithreading processor , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[14]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[15]  K. Sundaramoorthy,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[16]  Anne Rogers,et al.  Supporting dynamic data structures on distributed-memory machines , 1995, TOPL.

[17]  Yale N. Patt,et al.  Difficult-path branch prediction using subordinate microthreads , 2002, ISCA.

[18]  Andreas Moshovos,et al.  Improving virtual function call target prediction via dependence-based pre-computation , 1999, ICS '99.

[19]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..

[20]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[21]  Trevor N. Mudge,et al.  Author retrospective improving data cache performance by pre-executing instructions under a cache miss , 1997, International Conference on Supercomputing.

[22]  Gurindar S. Sohi,et al.  Master/slave speculative parallelization , 2002, MICRO.

[23]  Eduardo Sanchez,et al.  A Study of a Simultaneous Multithreaded Processor Implementation , 1999, Euro-Par.

[24]  Dolores R. Wallace,et al.  Using the Unravel Program Slicing Tool to Evaluate High Integrity Software , 1997 .

[25]  Olivier Temam,et al.  Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[26]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[27]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[28]  David Keppel,et al.  Shade: a fast instruction-set simulator for execution profiling , 1994, SIGMETRICS.

[29]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[30]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[31]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[32]  Gurindar S. Sohi,et al.  A quantitative framework for automated pre-execution thread selection , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[33]  Dionisios N. Pnevmatikatos,et al.  Slice-processors: an implementation of operation-based prediction , 2001, ICS '01.

[34]  John Paul Shen,et al.  Dynamic speculative precomputation , 2001, MICRO.

[35]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[36]  John Paul Shen,et al.  Post-pass binary adaptation for software-based speculative precomputation , 2002, PLDI '02.

[37]  Garth A. Gibson,et al.  Automatic I/O hint generation through speculative execution , 1999, OSDI '99.

[38]  Donald Yeung,et al.  Design and evaluation of compiler algorithms for pre-execution , 2002, ASPLOS X.

[39]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, TOCS.

[40]  M. Dubois,et al.  Assisted Execution , 1998 .

[41]  Ron Cytron,et al.  Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.