Pre-execution via speculative data-driven multithreading

This dissertation introduces pre-execution, a novel technique for accelerating sequential programs. Pre-execution directly attacks the instructions that cause performance problems—mis-predicted branches and cache missing loads. In pre-execution, future branch outcomes and load addresses are computed on the side and the results are fed to the main program. In doing so, the main program is spared from having to incur the full computation latencies of these instructions. Pre-execution exploits out-of-order fetch and decoupling. Fetching and executing only critical load and branch computations while skipping over all unrelated instructions allows pre-execution to compute values faster than the main program. Decoupling, doing so in a separate thread, isolates stalls that occur in these computations so that they do not directly impact the main program thread. This dissertation describes speculative data-driven multithreading (DDMT), an implementation of pre-execution. DDMT implements the runtime component of pre-execution—responsible for pre-executing computations and communicating the results to the main program—as an extension to a superscalar processor. In addition to using the single cache hierarchy to allow pre-executing computations to prefetch for the main program, DDMT stores individual pre-executed instruction results in the shared physical register and then passes them one-by-one to the main program via a novel modification to register renaming called register integration. For DDMT's setup component—responsible for finding load and branch computations and conveying them to the runtime component—this dissertation introduces an algorithm for automatically extracting performance-enhancing computations from program traces. The algorithm evaluates a benefit-cost function over all candidate computations in a trace and chooses those that maximize benefit (latency tolerance) while minimizing cost (execution overhead). The algorithm is formulated to permit software, hardware, and hybrid implementations. The dissertation includes a simulation-driven performance evaluation of DDMT Our results show that DDMT achieves 10% to 15% performance improvements for general-purpose integer programs running on an aggressive baseline processor with large caches, with the potential for greater improvements on likely future processor designs. We conclude that pre-execution and DDMT are promising technologies that merit consideration for inclusion in future machines.

[1]  Shlomit S. Pinter,et al.  Tango: a hardware-based data prefetching technique for superscalar processors , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[2]  Trevor Mudge,et al.  Improving data cache performance by pre-executing instructions under a cache miss , 1997 .

[3]  Augustus K. Uht,et al.  Disjoint eager execution: an optimal form of speculative execution , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[4]  Trevor N. Mudge,et al.  The YAGS branch prediction scheme , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[5]  Eric Rotenberg,et al.  A study of slipstream processors , 2000, MICRO 33.

[6]  James E. Smith,et al.  A study of branch prediction strategies , 1981, ISCA '98.

[7]  Jignesh M. Patel,et al.  Data prefetching by dependence graph precomputation , 2001, ISCA 2001.

[8]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[9]  Gurindar S. Sohi,et al.  Speculative data-driven multithreading , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[10]  John Paul Shen,et al.  Instruction path coprocessors , 2000, ISCA '00.

[11]  Mario Nemirovsky,et al.  Increasing superscalar performance through multistreaming , 1995, PACT.

[12]  Mateo Valero,et al.  Out-of-order vector architectures , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[13]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[14]  Mikko H. Lipasti,et al.  Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[15]  Haitham Akkary,et al.  A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[16]  Eric Rotenberg,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[17]  Andreas Moshovos,et al.  Improving virtual function call target prediction via dependence-based pre-computation , 1999, ICS '99.

[18]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[19]  Rajiv Gupta,et al.  Predictability of load/store instruction latencies , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[20]  Kevin B. Theobald,et al.  On the limits of program parallelism and its smoothability , 1992, MICRO 1992.

[21]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[22]  Gary S. Tyson,et al.  Improving the accuracy and performance of memory communication through renaming , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[23]  Yale N. Patt,et al.  Target prediction for indirect jumps , 1997, ISCA '97.

[24]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[25]  Karel Driesen,et al.  The cascaded predictor: economical and adaptive branch target prediction , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[26]  Harry Dwyer,et al.  An out-of-order superscalar processor with speculative execution and fast, precise interrupts , 1992, MICRO 1992.

[27]  Richard P. Hopkins,et al.  Data-Driven and Demand-Driven Computer Architecture , 1982, CSUR.

[28]  Karel Driesen,et al.  Accurate indirect branch prediction , 1998, ISCA.

[29]  Stéphan Jourdan,et al.  A novel renaming scheme to exploit value temporal locality through physical register reuse and unification , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[30]  Alan Eustace,et al.  ATOM - A System for Building Customized Program Analysis Tools , 1994, PLDI.

[31]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[32]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[33]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[34]  Masato Edahiro,et al.  A Single-Chip Multiprocessor for Smart Terminals , 2000, IEEE Micro.

[35]  Alan Jay Smith,et al.  Branch Prediction Strategies and Branch Target Buffer Design , 1995, Computer.

[36]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.

[37]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[38]  E. Smith,et al.  Selective Dual Path Execution , 1996 .

[39]  Mikko H. Lipasti,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, MICRO 28.

[40]  G.S. Sohi,et al.  Dynamic instruction reuse , 1997, ISCA '97.

[41]  Uri C. Weiser,et al.  Correlated load-address predictors , 1999, ISCA.

[42]  Hwa C. Torng,et al.  The Concurrent Execution of Multiple Instruction Streams on Superscalar Processors , 1991, ICPP.

[43]  C. Zilles,et al.  Understanding the backward slices of performance degrading instructions , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[44]  Shai Rubin,et al.  Focusing processor policies via critical-path prediction , 2001, ISCA 2001.

[45]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[46]  M. Martonosi,et al.  Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[47]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[48]  Andreas Moshovos,et al.  Memory dependence speculation tradeoffs in centralized, continuous-window superscalar processors , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[49]  Andrew R. Pleszkun,et al.  Implementation of precise interrupts in pipelined processors , 1985, ISCA '98.

[50]  Joseph T. Rahmeh,et al.  Improving the accuracy of dynamic branch prediction using branch correlation , 1992, ASPLOS V.

[51]  Manoj Franklin,et al.  The multiscalar architecture , 1993 .

[52]  Anne Rogers,et al.  Supporting dynamic data structures on distributed-memory machines , 1995, TOPL.

[53]  Rajeev Balasubramonian,et al.  Dynamically allocating processor resources between nearby and distant ILP , 2001, ISCA 2001.

[54]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[55]  Dirk Grunwald,et al.  Selective eager execution on the PolyPath architecture , 1998, ISCA.

[56]  Jeffrey Dean,et al.  ProfileMe: hardware support for instruction-level profiling on out-of-order processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[57]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[58]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[59]  Kevin O'Brien,et al.  Single-program speculative multithreading (SPSM) architecture: compiler-assisted fine-grained multithreading , 1995, PACT.

[60]  Yale N. Patt,et al.  Simultaneous subordinate microthreading (SSMT) , 1999, ISCA.

[61]  Mikko H. Lipasti Value locality and speculative execution , 1998 .

[62]  John Paul Shen,et al.  PipeRench implementation of the instruction path coprocessor , 2000, MICRO 33.

[63]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[64]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[65]  Luddy Harrison Examination of a memory access classification scheme for pointer-intensive and numeric programs , 1996, ICS '96.

[66]  Brad Calder,et al.  Threaded multiple path execution , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[67]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[68]  J.E. Smith,et al.  Achieving high performance via co-designed virtual machines , 1998, Innovative Architecture for Future Generation High-Performance Processors and Systems.

[69]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.