Rigel: an architecture and scalable programming interface for a 1000-core accelerator

This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-data (SPMD) execution model. Rigel's low-level programming interface adopts a single global address space model where parallel work is expressed in a task-centric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and/or restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications. We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GFLOPS/mm2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.

[1]  Sanjay J. Patel,et al.  Tradeoffs in designing accelerator architectures for visual computing , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[2]  William J. Dally,et al.  Programmable Stream Processors , 2003, Computer.

[3]  Eric Darve,et al.  N-Body simulation on GPUs , 2006, SC.

[4]  Michael Gschwind Chip multiprocessing and the cell broadband engine , 2006, CF '06.

[5]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[6]  Sivarama P. Dandamudi,et al.  A Hierarchical Task Queue Organization for Shared-Memory Multiprocessor Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[7]  Jonathan Chang,et al.  A 45 nm 8-Core Enterprise Xeon¯ Processor , 2010, IEEE J. Solid State Circuits.

[8]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[9]  James R. Goodman,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[10]  Klaus Schulten,et al.  Accelerating Molecular Modeling Applications with GPU Computing , 2009 .

[11]  Pradeep Dubey,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[12]  Guy E. Blelloch,et al.  Scans as Primitive Parallel Operations , 1989, ICPP.

[13]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[14]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[15]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[16]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[17]  Daniel Gajski,et al.  CEDAR: a large scale multiprocessor , 1983, CARN.

[18]  Stefan Rusu,et al.  A 45nm 8-core enterprise Xeon ® processor , 2009 .

[19]  Mike Houston,et al.  GPUs a closer look , 2008, SIGGRAPH '08.

[20]  Burton J. Smith,et al.  The architecture of HEP , 1985 .

[21]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[22]  Marc Tremblay,et al.  A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC® Processor , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[23]  Norman P. Jouppi,et al.  Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[24]  David A. Padua,et al.  Hierarchically tiled arrays for parallelism and locality , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[25]  Justin P. Haldar,et al.  Accelerating advanced MRI reconstructions on GPUs , 2008, J. Parallel Distributed Comput..

[26]  William J. Dally,et al.  Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[27]  William J. Dally,et al.  Sequoia: Programming the Memory Hierarchy , 2006, International Conference on Software Composition.

[28]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.