Intelligent speculation for pipelined multithreading

In recent years, microprocessor manufacturers have shifted their focus from single-core to multi-core processors. Since many of today’s applications are single-threaded and since it is likely that many of tomorrow’s applications will have far fewer threads than there will be processor cores, automatic thread extraction is an essential tool for effectively leveraging today’s multi-core and tomorrow’s many-core processors. A recently proposed technique, Decoupled Software Pipelining (DSWP), has demonstrated promise by partitioning loops into long-running threads organized into a pipeline. Using a pipeline organization and execution decoupled by inter-core communication queues, DSWP offers increased execution efficiency that is largely independent of inter-core communication latency and variability in intra-thread performance. This dissertation extends the pipelined parallelism paradigm with speculation. Using speculation, dependences that manifest infrequently or are easily predictable can be safely ignored by the compiler allowing it to carve more, and better balanced, thread-based pipeline stages from a single thread of execution. Prior speculative threading proposals were obligated to speculate most, if not all, loop-carried dependences to squeeze the code segment under consideration into the mold required by the parallelization paradigm. Unlike those techniques, this dissertation demonstrates that speculation need only break the longest few dependence cycles to enhance the applicability and scalability of the pipelined multi-threading paradigm. By speculatively breaking these cycles, instructions that were formerly restricted to a single thread to ensure decoupling are now free to span multiple threads. To demonstrate the effectiveness of speculative pipelined multi-threading, this dissertation presents the design and experimental evaluation of our fully automatic compiler transformation, Speculative Decoupled Software Pipelining, a speculative extension to DSWP. This dissertation additionally introduces multi-threaded transactional memories to support speculative pipelined multi-threading. Similar to past speculative parallelization approaches, speculative pipelined multi-threading relies on runtime-system support to buffer speculative modifications to memory. However, this dissertation demonstrates that existing proposals to buffer speculative memory state, transactional memories, are insufficient for speculative pipelined multi-threading because the speculative buffers are restricted to a single thread. Further, this dissertation demonstrates that this limitation leads to modularity and composability problems even for transactional programming, thus limiting the potential of that approach also. To surmount these limitations, this thesis introduces multi-threaded transactional memories and presents an initial hardware implementation.

[1]  Matthias F. Stallmann,et al.  Optimization algorithms for the minimum-cost satisfiability problem , 2004 .

[2]  Josep Torrellas,et al.  Eliminating squashes through learning cross-thread violations in speculative parallelization for multiprocessors , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[3]  Josep Torrellas,et al.  A Chip-Multiprocessor Architecture with Speculative Multithreading , 1999, IEEE Trans. Computers.

[4]  Radu Iosif,et al.  A deadlock detection tool for concurrent Java programs , 1999, Softw. Pract. Exp..

[5]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[6]  Wei Liu,et al.  POSH: a TLS compiler that exploits program structure , 2006, PPoPP '06.

[7]  Rudolf Eigenmann,et al.  Min-cut program decomposition for thread-level speculation , 2004, PLDI '04.

[8]  Haitham Akkary,et al.  A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[9]  Ron Cytron,et al.  Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.

[10]  Kunle Olukotun,et al.  Transactional coherence and consistency: simplifying parallel hardware and software , 2004, IEEE Micro.

[11]  Maurice Herlihy,et al.  A flexible framework for implementing software transactional memory , 2006, OOPSLA '06.

[12]  Antonia Zhai,et al.  Compiler optimization of value communication for thread-level speculation , 2005 .

[13]  James C. Corbett,et al.  Evaluating Deadlock Detection Methods for Concurrent Software , 1996, IEEE Trans. Software Eng..

[14]  David I. August,et al.  Shape analysis with inductive recursion synthesis , 2007, PLDI '07.

[15]  Jeremy T. Fineman,et al.  Nested parallelism in transactional memory , 2008, PPoPP.

[16]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[17]  James Coyle,et al.  Deadlock detection in MPI programs , 2002, Concurr. Comput. Pract. Exp..

[18]  David A. Padua,et al.  Automatic detection of nondeterminacy in parallel programs , 1988, PADD '88.

[19]  Josep Torrellas,et al.  Bulk Disambiguation of Speculative Threads in Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[20]  Easwaran Raman,et al.  A framework for unrestricted whole-program optimization , 2006, PLDI '06.

[21]  Utpal Banerjee Loop Parallelization , 1994, Springer US.

[22]  Bratin Saha,et al.  McRT-STM: a high performance software transactional memory system for a multi-core runtime , 2006, PPoPP '06.

[23]  Wen-mei W. Hwu,et al.  Field-testing IMPACT EPIC research results in Itanium 2 , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[24]  Mikko H. Lipasti,et al.  Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[25]  Yun Zhang,et al.  Revisiting the Sequential Programming Model for the Multicore Era , 2008, IEEE Micro.

[26]  Easwaran Raman,et al.  Speculative Decoupled Software Pipelining , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[27]  Hong-Seok Kim,et al.  Bottom-Up and Top-Down Context-Sensitive Summary-Based Pointer Analysis , 2004, SAS.

[28]  David A. Wood,et al.  LogTM: log-based transactional memory , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[29]  S. Malik,et al.  Solving the Minimum-Cost Satisfiability Problem Using SAT Based Branch-and-Bound Search , 2006, 2006 IEEE/ACM International Conference on Computer Aided Design.

[30]  Yun Zhang,et al.  Revisiting the Sequential Programming Model for Multi-Core , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[31]  Josep Torrellas,et al.  Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors , 2005, TACO.

[32]  Kunle Olukotun,et al.  The Stanford Hydra CMP , 2000, IEEE Micro.

[33]  Bradley C. Kuszmaul,et al.  Unbounded Transactional Memory , 2005, HPCA.

[34]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[35]  Antonio González,et al.  Value prediction for speculative multithreaded architectures , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[36]  Thomas F. Wenisch,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, ISCA '03.

[37]  Gurindar S. Sohi,et al.  Master/Slave Speculative Parallelization , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[38]  Antonia Zhai,et al.  Improving value communication for thread-level speculation , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[39]  Antonia Zhai,et al.  The STAMPede approach to thread-level speculation , 2005, TOCS.

[40]  Brian T. Lewis,et al.  Compiler and runtime support for efficient software transactional memory , 2006, PLDI '06.

[41]  Bowen Alpern,et al.  Detecting equality of variables in programs , 1988, POPL '88.

[42]  Donald Yeung,et al.  A study of source-level compiler algorithms for automatic construction of pre-execution code , 2004, TOCS.

[43]  Paul Feautrier,et al.  Direct parallelization of call statements , 1986, SIGPLAN '86.

[44]  Josep Torrellas,et al.  Architectural support for scalable speculative parallelization in shared-memory multiprocessors , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[45]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[46]  David I. August,et al.  Decoupled software pipelining with the synchronization array , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[47]  Mark Moir,et al.  Toward high performance nonblocking software transactional memory , 2008, PPOPP.

[48]  David I. August,et al.  Pipelined multithreading transformations and support mechanisms , 2007 .

[49]  Jenq Kuen Lee,et al.  Interprocedural probabilistic pointer analysis , 2004, IEEE Transactions on Parallel and Distributed Systems.

[50]  Jian Huang,et al.  The Superthreaded Processor Architecture , 1999, IEEE Trans. Computers.

[51]  Easwaran Raman,et al.  Parallel-stage decoupled software pipelining , 2008, CGO '08.

[52]  Gurindar S. Sohi,et al.  Speculative Versioning Cache , 2001, IEEE Trans. Parallel Distributed Syst..

[53]  Scott A. Mahlke,et al.  Uncovering hidden loop level parallelism in sequential applications , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[54]  Thomas F. Wenisch,et al.  TurboSMARTS: accurate microarchitecture simulation sampling in minutes , 2005, SIGMETRICS '05.

[55]  Manoj Franklin,et al.  A general compiler framework for speculative multithreading , 2002, SPAA '02.

[56]  Guilherme Ottoni,et al.  Support for High-Frequency Streaming in CMPs , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[57]  Monica S. Lam,et al.  In search of speculative thread-level parallelism , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[58]  David B. Loveman,et al.  Program Improvement by Source-to-Source Transformation , 1977, J. ACM.

[59]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[60]  Guilherme Ottoni,et al.  Communication optimizations for global multi-threaded instruction scheduling , 2008, ASPLOS.

[61]  Michael Burrows,et al.  Eraser: a dynamic data race detector for multithreaded programs , 1997, TOCS.

[62]  David I. August,et al.  Systematic compilation for predicated execution , 2000 .

[63]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[64]  Etienne Morel,et al.  Global optimization by suppression of partial redundancies , 1979, CACM.

[65]  Laurie J. Hendren,et al.  Is it a tree, a DAG, or a cyclic graph? A shape analysis for heap-directed pointers in C , 1996, POPL '96.

[66]  Randal E. Bryant,et al.  Graph-Based Algorithms for Boolean Function Manipulation , 1986, IEEE Transactions on Computers.

[67]  J. Gregory Steffan,et al.  A probabilistic pointer analysis for speculative optimizations , 2006, ASPLOS XII.

[68]  Wen-mei W. Hwu,et al.  Modular interprocedural pointer analysis using access paths: design, implementation, and evaluation , 2000, PLDI '00.

[69]  Mikko H. Lipasti,et al.  Silent stores for free , 2000, MICRO 33.

[70]  Antonio González,et al.  Clustered speculative multithreaded processors , 1999, ICS '99.

[71]  Brad Calder,et al.  Using SimPoint for accurate and efficient simulation , 2003, SIGMETRICS '03.