Polymorphous Computing Architectures

Abstract : We describe the architecture and hardware implementation of a coarse grain parallel computing system with flexibility in both memory and processing elements. The memory subsystem supports a wide range of programming models efficiently, including cache coherency, message passing, streaming, and transactions. The memory controller implements these models using metadata stored with each memory block. Processor flexibility is provided using Tensilica Xtensa cores. We use Xtensa processor options and Tensilica Instruction Extension language (TIE) to provide additional computational capabilities, to define additional memory operations needed to support our controller, and to add VLIW instructions for increased efficiency. In our implementation, two processors share multiple memory blocks via a load/store unit and a crossbar switch. These dual processor tiles are grouped into quads that share a memory protocol controller. Quads connect to one another and to the off-chip memory controller via a mesh-like network. We describe the design of each block in detail. We also describe our implementation of transactional memory. Transactional Coherence and Consistency (TCC) provides greater scalability than previous TM architectures by deferring conflict detection until commit time and by using directories to reduce overhead. We demonstrate near linear scaling up to 64 processors with less than 5% overhead.

[1]  Allan Hartstein,et al.  Optimum Power/Performance Pipeline Depth , 2003, MICRO.

[2]  S. Asano,et al.  The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[3]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[4]  Marcelo Yuffe,et al.  The Implementation of the 65nm Dual-Core 64b Merom Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[5]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[6]  M. Singh,et al.  A dual-core 64 b UltraSPARC microprocessor for dense server applications , 2004, 2004 IEEE International Solid-State Circuits Conference (IEEE Cat. No.04CH37519).

[7]  Christoforos E. Kozyrakis,et al.  Comparing memory systems for chip multiprocessors , 2007, ISCA '07.

[8]  David A. Wood,et al.  LogTM: log-based transactional memory , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[9]  David Eisenstat,et al.  Hardware Acceleration of Software Transactional Memory , 2005 .

[10]  Michael Gschwind,et al.  Optimizing pipelines for power and performance , 2002, MICRO.

[11]  E. L. Lusk,et al.  Use of monitors in FORTRAN: a tutorial on the barrier, self-scheduling DO-loop, and askfor monitors , 1985 .

[12]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[13]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[14]  Mark Moir,et al.  Hybrid transactional memory , 2006, ASPLOS XII.

[15]  Milo M. K. Martin,et al.  Deconstructing Transactional Semantics: The Subtleties of Atomicity , 2005 .

[16]  Kunle Olukotun,et al.  Tradeoffs in transactional memory virtualization , 2006, ASPLOS XII.

[17]  Mark Horowitz,et al.  Stream Virtual Machine and Two-Level Compilation Model for Streaming Architectures and Languages , 2004 .

[18]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[19]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[20]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[21]  Kunle Olukotun,et al.  The OpenTM Transactional Application Programming Interface , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[22]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[23]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[24]  Kunle Olukotun,et al.  Architectural Semantics for Practical Transactional Memory , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[25]  Kunle Olukotun,et al.  The Stanford Hydra CMP , 2000, IEEE Micro.

[26]  William J. Dally,et al.  Evaluating the Imagine stream architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[27]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[28]  Bil Lewis,et al.  Multithreaded Programming With PThreads , 1997 .

[29]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[30]  M. Horowitz,et al.  The stream virtual machine , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[31]  William J. Dally,et al.  Imagine: Media Processing with Streams , 2001, IEEE Micro.

[32]  Eduard Ayguadé,et al.  Transactional Memory: An Overview , 2007, IEEE Micro.

[33]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[34]  William J. Dally,et al.  Memory hierarchy design for stream computing , 2005 .

[35]  M. Horowitz,et al.  How scaling will change processor architecture , 2004, 2004 IEEE International Solid-State Circuits Conference (IEEE Cat. No.04CH37519).

[36]  Christoforos E. Kozyrakis,et al.  Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks , 2002, MICRO.

[37]  Kunle Olukotun,et al.  Transactional Memory: The Hardware-Software Interface , 2007, IEEE Micro.

[38]  Christoforos E. Kozyrakis,et al.  Comparative evaluation of memory models for chip multiprocessors , 2008, TACO.

[39]  Quinn Jacobson,et al.  Architectural Support for Software Transactional Memory , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[40]  M. Horowitz,et al.  Architecture and circuit techniques for a reconfigurable memory block , 2004, 2004 IEEE International Solid-State Circuits Conference (IEEE Cat. No.04CH37519).

[41]  Balaram Sinharoy,et al.  IBM Power5 chip: a dual-core multithreaded processor , 2004, IEEE Micro.

[42]  Kunle Olukotun,et al.  The Atomos transactional programming language , 2006, PLDI '06.

[43]  Maurice Herlihy,et al.  A methodology for implementing highly concurrent data structures , 1990, PPOPP '90.

[44]  Kunle Olukotun,et al.  A Scalable, Non-blocking Approach to Transactional Memory , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[45]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[46]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..