Pipelined multithreading transformations and support mechanisms

Even though chip multiprocessors have emerged as the predominant organization for future microprocessors, the multiple on-chip cores do not directly result in improved application performance (especially for legacy applications, which are predominantly sequential C/C++ codes). Consequently, parallelizing applications to execute on multiple cores is essential to their success. Independent multithreading techniques, like DOALL extraction, create partially or fully independent threads, which communicate rarely, if at all. While such strategies keep high inter-thread communication costs from impacting program performance, they cannot be applied to parallelize general-purpose applications which are characterized by difficult-to-break recurrences. Even though cyclic multithreading techniques, such as DOACROSS, are more applicable, the cyclic inter-thread dependences created by these techniques cause them to have very low tolerance to rising inter-core latencies. To address these problems, this work introduces a pipelined multithreading (PMT) transformation called Decoupled Software Pipelining (DSWP). DSWP, in particular, and PMT techniques, in general, are able to tolerate inter-core latencies, while still handling codes with complex recurrences. They achieve this by enforcing an acyclic communication discipline amongst threads, which allow threads to use inter-thread queues to communicate in a pipelined fashion. This dissertation demonstrates that DSWPed codes not only tolerate inter-core communication costs, but also effectively tolerate variable latency stalls in applications better than single-threaded execution on both in-order and out-of-order issue processors with comparable resources. It then performs a thorough analysis of the performance scalability of automatically generated DSWPed codes and identifies the conditions necessary to achieve peak PMT performance. Next, the dissertation shows that even though PMT applications tolerate inter-core latencies well, the high frequency of inter-thread communication (once every 5 to 20 dynamic instructions) in these codes, makes them very sensitive to the intra-thread overhead imposed by communication operations. In order to understand the issues surrounding inter-thread communication for PMT applications, this dissertation undertakes a methodical exploration of the design space of communication support options for PMT. Three new communication mechanisms with varying cost-performance tradeoffs are introduced and are shown to perform 38% to 200% better than the state of the art.

[1]  Guilherme Ottoni,et al.  Support for High-Frequency Streaming in CMPs , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[2]  Scott Mahlke,et al.  Exploiting Instruction Level Parallelism in the Presence of Conditional Branches , 1997 .

[3]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[4]  Babak Falsafi,et al.  Coherent Network Interfaces for Fine-Grain Communication , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[5]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[6]  Gurindar S. Sohi,et al.  Instruction issue logic for high-performance, interruptable pipelined processors , 1987, ISCA '87.

[7]  Ron Cytron,et al.  Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.

[8]  Mateo Valero,et al.  A decoupled KILO-instruction processor , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[9]  Mateo Valero,et al.  Toward kilo-instruction processors , 2004, TACO.

[10]  John Paul Shen,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[11]  Ken Kennedy,et al.  Conversion of control dependence to data dependence , 1983, POPL '83.

[12]  Michael J. Flynn,et al.  Communication mechanisms in shared memory multiprocessors , 1998 .

[13]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[14]  Anant Agarwal,et al.  Integrating message-passing and shared-memory: early experience , 1993, PPOPP '93.

[15]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[16]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[17]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[18]  Sanjay J. Patel,et al.  Beating in-order stalls with "flea-flicker" two-pass pipelining , 2006, IEEE transactions on computers.

[19]  T. Gross,et al.  !Warp-anatomy of a parallel computing system , 1999, IEEE Concurrency.

[20]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[21]  Yale N. Patt,et al.  Simultaneous subordinate microthreading , 2004 .

[22]  Dean M. Tullsen,et al.  Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[23]  Pradip Bose,et al.  Optimizing pipelines for power and performance , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[24]  David I. August,et al.  Microarchitectural exploration with Liberty , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[25]  Long Li,et al.  Automatically partitioning packet processing applications for pipelined architectures , 2005, PLDI '05.

[26]  David I. August,et al.  The liberty structural specification language: a high-level modeling language for component reuse , 2004, PLDI '04.

[27]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture , 2003, IEEE Micro.

[28]  David I. August,et al.  Amortizing Software Queue Overhead for Pipelined Inter-Thread Communication , 2006 .

[29]  Lizy Kurian John,et al.  Efficiently Evaluating Speedup Using Sampled Processor Simulation , 2004, IEEE Computer Architecture Letters.

[30]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[31]  Haitham Akkary,et al.  Checkpoint processing and recovery: towards scalable large instruction window processors , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[32]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[33]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[34]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[35]  Jignesh M. Patel,et al.  Data prefetching by dependence graph precomputation , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[36]  Anant Agarwal,et al.  Scalar operand networks , 2005, IEEE Transactions on Parallel and Distributed Systems.

[37]  Anand Sivasubramaniam,et al.  Architectural Mechanisms for Explicit Communication in Shared Memory Multiprocessors , 1995, SC.

[38]  G. H. Barnes,et al.  A controllable MIMD architecture , 1986 .

[39]  John Wawrzynek,et al.  A Streaming Multi-Threaded Model , 2001 .

[40]  John Flynn,et al.  Adapting the SPEC 2000 benchmark suite for simulation-based computer architecture research , 2001 .

[41]  Gurindar S. Sohi,et al.  Speculative data-driven multithreading , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[42]  David I. August,et al.  Rapid Development of a Flexible Validated Processor Model , 2004 .

[43]  Michael Wolfe,et al.  Beyond induction variables: detecting and classifying sequences using a demand-driven SSA form , 1995, TOPL.

[44]  Masaru Takesue,et al.  Software queue-based algorithms for pipelined synchronization on multiprocessors , 2003, 2003 International Conference on Parallel Processing Workshops, 2003. Proceedings..

[45]  James R. Goodman,et al.  Inferential Queueing and Speculative Push , 2003, ICS '03.

[46]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[47]  Kunle Olukotun,et al.  The Stanford Hydra CMP , 2000, IEEE Micro.

[48]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[49]  Roy Dz-Ching Ju,et al.  Characterization of Repeating Data Access Patterns in Integer Benchmarks , 2001 .

[50]  David I. August,et al.  Decoupled software pipelining with the synchronization array , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[51]  Jian Huang,et al.  The Superthreaded Processor Architecture , 1999, IEEE Trans. Computers.

[52]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[53]  Wen-mei W. Hwu,et al.  "Flea-flicker" multipass pipelining: an alternative to the high-power out-of-order offense , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[54]  G. Sohi,et al.  Effective jump-pointer prefetching for linked data structures , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[55]  Gurindar S. Sohi,et al.  Master/Slave Speculative Parallelization , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[56]  Sarita V. Adve,et al.  An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[57]  David Alejandro Padua Haiek Multiprocessors: discussion of some theoretical and practical problems , 1980 .

[58]  Anoop Gupta,et al.  Integration of message passing and shared memory in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[59]  Roland E. Wunderlich,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[60]  Easwaran Raman,et al.  A framework for unrestricted whole-program optimization , 2006, PLDI '06.

[61]  David K. Poulsen Memory latency reduction via data prefetching and data forwarding in shared memory multiprocessors , 1994 .

[62]  Mary K. Vernon,et al.  A Hybrid Shared Memory/Message Passing Parallel Machine , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[63]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[64]  Thomas F. Wenisch,et al.  TurboSMARTS: accurate microarchitecture simulation sampling in minutes , 2005, SIGMETRICS '05.

[65]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[66]  S. Vajapeyam,et al.  Improving Superscalar Instruction Dispatch And Issue By Exploiting Dynamic Code Sequences , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[67]  Eric Rotenberg,et al.  A study of slipstream processors , 2000, MICRO 33.

[68]  Antonia Zhai,et al.  The STAMPede approach to thread-level speculation , 2005, TOCS.