Tiled Multicore Processors

For the last few decades Moore’s Law has continually provided exponential growth in the number of transistors on a single chip. This chapter describes a class of architectures, called tiled multicore architectures, that are designed to exploit massive quantities of on-chip resources in an efficient, scalable manner. Tiled multicore architectures combine each processor core with a switch to create a modular element called a tile. Tiles are replicated on a chip as needed to create multicores with any number of tiles. The Raw processor, a pioneering example of a tiled multicore processor, is examined in detail to explain the philosophy, design, and strengths of such architectures. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance. Central to achieving this goal is Raw’s ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Raw approaches this challenge by implementing plenty of on-chip resources – including logic, wires, and pins – in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Compared to a traditional superscalar processor, Raw performs within a factor of 2x for sequential applications with a very low degree of ILP, about 2x–9x better for higher levels of ILP, and 10x–100x better when highly parallel applications are coded in a stream language or optimized by hand.

[1]  Victor Lee,et al.  The RAW benchmark suite: computation structures for general purpose computing , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[2]  H. T. Kung,et al.  The Warp Computer: Architecture, Implementation, and Performance , 1987, IEEE Transactions on Computers.

[3]  Henry Hoffmann,et al.  Stream Algorithms and Architecture , 2004, J. Instr. Level Parallelism.

[4]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[5]  Seth Copen Goldstein,et al.  PipeRench: a co/processor for streaming multimedia acceleration , 1999, ISCA.

[6]  Christoforos E. Kozyrakis,et al.  A New Direction for Computer Architecture Research , 1998, Computer.

[7]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[8]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[9]  Jack Dongarra,et al.  LAPACK: a portable linear algebra library for high-performance computers , 1990, SC.

[10]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[11]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[12]  Simha Sethumadhavan,et al.  Distributed Microarchitectural Protocols in the TRIPS Prototype Processor , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[13]  David Wentzlaff Architectural implications of bit-level computation in communication applications , 2002 .

[14]  Anant Agarwal,et al.  Scalar operand networks: on-chip interconnect for ILP in partitioned architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[15]  R. Nagarajan,et al.  A design space evaluation of grid processor architectures , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[16]  David Shoemaker,et al.  NuMesh: An architecture optimized for scheduled communication , 2004, The Journal of Supercomputing.

[17]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[18]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[19]  Michael Taylor Deionizer: A Tool for Capturing and Embedding I/O Cells , 2004 .

[20]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[21]  M. Bohr Interconnect scaling-the real limiter to high performance ULSI , 1995, Proceedings of International Electron Devices Meeting.

[22]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[23]  Doug Matzke,et al.  Will Physical Scalability Sabotage Performance Gains? , 1997, Computer.

[24]  K. Yelick,et al.  Generating Permutation Instructions from a High-Level Description , 2004 .

[25]  Rajeev Barua,et al.  Maps: a compiler-managed memory system for raw machines , 1999, ISCA.

[26]  William J. Dally,et al.  The Imagine Stream Processor , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[27]  Pradip Bose,et al.  Guest Editors' Introduction: Power and Complexity Aware Design , 2003, IEEE Micro.

[28]  Jason E. Miller,et al.  Software instruction caching , 2007 .

[29]  Antonio González,et al.  Modulo scheduling for a fully-distributed clustered VLIW architecture , 2000, MICRO 33.

[30]  David G. Stork Happy Birthday, HAL! , 1997, Computer.

[31]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[32]  Stephen P. Crago,et al.  A performance analysis of PIM, stream processing, and tiled processing on memory-intensive signal processing kernels , 2003, ISCA '03.

[33]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[34]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[35]  Anant Agarwal,et al.  Scalar operand networks , 2005, IEEE Transactions on Parallel and Distributed Systems.

[36]  Donald Yeung,et al.  SimpleFit: A Framework for Analyzing Design Trade-Offs in Raw Architectures , 2001, IEEE Trans. Parallel Distributed Syst..

[37]  John Wawrzynek,et al.  Garp: a MIPS processor with a reconfigurable coprocessor , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[38]  A. J. KleinOsowski,et al.  MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.

[39]  John Kubiatowicz,et al.  Integrated shared-memory and message-passing communication in the Alewife multiprocessor , 1998 .

[40]  Christopher Batten,et al.  The Vector-Thread Architecture , 2004, ISCA 2004.

[41]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[42]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[43]  Samuel D. Naffziger,et al.  The implementation of the next-generation 64b itanium microprocessor , 2002 .

[44]  T. Gross,et al.  !Warp-anatomy of a parallel computing system , 1999, IEEE Concurrency.

[45]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[46]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[47]  Mark Stephenson,et al.  Convergent scheduling , 2002, MICRO 35.