Support for High-Frequency Streaming in CMPs

As the industry moves toward larger-scale chip multiprocessors, the need to parallelize applications grows. High inter-thread communication delays, exacerbated by over-stressed high-latency memory subsystems and ever-increasing wire delays, require parallelization techniques to create partially or fully independent threads to improve performance. Unfortunately, developers and compilers alike often fail to find sufficient independent work of this kind. Recently proposed pipelined streaming techniques have shown significant promise for both manual and automatic parallelization. These techniques have wide-scale applicability because they embrace inter-thread dependences (albeit acyclic dependences) and tolerate long-latency communication of these dependences. This paper addresses the lack of architectural support for this type of concurrency, which has blocked its adoption and hindered related language and compiler research. We observe that both manual and automatic techniques create high-frequency streaming threads, with communication occurring every 5 to 20 instructions. Even while easily tolerating inter-thread transit delays, high-frequency communication makes thread performance very sensitive to intra-thread delays from the repeated execution of the communication operations. Using this observation, we define the design-space and evaluate several mechanisms to find a better trade-off between performance and operating system, hardware, and design costs. From this, we find a light-weight streaming-aware enhancement to conventional memory subsystems that doubles the speed of these codes and is within 2% of the best-performing, but heavy-weight, hardware solution

[1]  David I. August,et al.  Decoupled software pipelining with the synchronization array , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[2]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[3]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[4]  Long Li,et al.  Automatically partitioning packet processing applications for pipelined architectures , 2005, PLDI '05.

[5]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[6]  Michael J. Flynn,et al.  Communication mechanisms in shared memory multiprocessors , 1998 .

[7]  David I. August,et al.  Microarchitectural exploration with Liberty , 2002, MICRO 35.

[8]  Douglas W. Clark,et al.  Proceedings of the sixth international conference on Architectural support for programming languages and operating systems , 1994, ASPLOS 1994.

[9]  Sarita V. Adve,et al.  An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[10]  Bruno R. Preiss,et al.  A cache-based message passing scheme for a shared-bus multiprocessor , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[11]  Dean M. Tullsen,et al.  Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[12]  David I. August,et al.  Rapid Development of a Flexible Validated Processor Model , 2004 .

[13]  Masaru Takesue,et al.  Software queue-based algorithms for pipelined synchronization on multiprocessors , 2003, 2003 International Conference on Parallel Processing Workshops, 2003. Proceedings..

[14]  James R. Goodman,et al.  Inferential Queueing and Speculative Push , 2003, ICS '03.

[15]  Anant Agarwal,et al.  Scalar operand networks , 2005, IEEE Transactions on Parallel and Distributed Systems.

[16]  Anand Sivasubramaniam,et al.  Architectural Mechanisms for Explicit Communication in Shared Memory Multiprocessors , 1995, SC.

[17]  John Wawrzynek,et al.  A Streaming Multi-Threaded Model , 2001 .

[18]  Anoop Gupta,et al.  Integration of message passing and shared memory in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[19]  David K. Poulsen Memory latency reduction via data prefetching and data forwarding in shared memory multiprocessors , 1994 .

[20]  Mary K. Vernon,et al.  A Hybrid Shared Memory/Message Passing Parallel Machine , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[21]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[22]  Anant Agarwal,et al.  Integrating message-passing and shared-memory: early experience , 1993, SIGP.

[23]  T. Gross,et al.  !Warp-anatomy of a parallel computing system , 1999, IEEE Concurrency.

[24]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[25]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.