Tiled microprocessors

Current-day microprocessors have reached the point of diminishing returns due to inherent scalability limitations. This thesis examines the tiled microprocessor, a class of microprocessor which is physically scalable but inherits many of the desirable properties of conventional microprocessors. Tiled microprocessors are composed of an array of replicated tiles connected by a special class of network, the Scalar Operand Network (SON), which is optimized for low-latency, low-occupancy communication between remote ALUs on different tiles. Tiled microprocessors can be constructed to scale to 100's or 1000's of functional units. This thesis identifies seven key criteria for achieving physical scalability in tiled microprocessors. It employs an archetypal tiled microprocessor to examine the challenges in achieving these criteria and to explore the properties of Scalar Operand Networks. The thesis develops the field of SONS in three major ways: it introduces the 5-tuple performance metric, it describes a complete, high-frequency SON implementation, and it proposes a taxonomy, called AsTrO, for categorizing them. To develop these ideas, the thesis details the design, implementation and analysis of a tiled microprocessor prototype, the Raw Microprocessor, which was implemented at MIT in 180 run technology. Overall, compared to Raw, recent commercial processors with half the transistors required 30x as many lines of code, occupied 100x as many designers, contained 50x as many pre-tapeout bugs, and resulted in 33x as many post-tapeout bugs. At the same time, the Raw microprocessor proves to be more versatile in exploiting ILP, stream, and server-farm workloads with modest to large amounts of parallelism. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  Anant Agarwal,et al.  Scalar operand networks: on-chip interconnect for ILP in partitioned architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[2]  Jessica H. Tseng Banked microarchitectures for complexity-effective superscalar microprocessors , 2006 .

[3]  Bob Bentley,et al.  Validating the Intel(R) Pentium(R) 4 microprocessor , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[4]  Karthikeyan Sankaralingam,et al.  Routed inter-ALU networks for ILP scalability and performance , 2003, Proceedings 21st International Conference on Computer Design.

[5]  William Thies,et al.  Phased scheduling of stream programs , 2003 .

[6]  José Duato,et al.  A General Theory for Deadlock-Free Adaptive Routing Using a Mixed Set of Resources , 2001, IEEE Trans. Parallel Distributed Syst..

[7]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[8]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[9]  William Thies,et al.  Teleport messaging for distributed stream programs , 2005, PPoPP.

[10]  B. Flietner,et al.  'System on a chip' technology platform for 0.18 /spl mu/m digital, mixed signal and eDRAM applications , 1999, International Electron Devices Meeting 1999. Technical Digest (Cat. No.99CH36318).

[11]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[12]  Bob Iannucci Toward a dataflow/von Neumann hybrid architecture , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[13]  Michael Bedford Taylor,et al.  Design decision in the implementation of a raw architecture workstation , 1999 .

[14]  MARC TREMBLAY,et al.  The Design of the Microarchitecture of UltraSPARCTM-I , 1995 .

[15]  Anant Agarwal,et al.  A quantitative comparison of reconfigurable, tiled, and conventional architectures on bit-level computation , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[16]  R. Ho Chip Wires: Scaling and Efficiency , 2003 .

[17]  Samuel D. Naffziger,et al.  The implementation of the Itanium 2 microprocessor , 2002, IEEE J. Solid State Circuits.

[18]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[19]  Stanley Mazor,et al.  The history of the 4004 , 1996, IEEE Micro.

[20]  Charles L. Seitz,et al.  Design of the Mosaic Element , 1983 .

[21]  Venkatesh Akella,et al.  Synchroscalar: a multiple clock domain, power-aware, tile-based embedded processor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[22]  Rajeev Barua,et al.  Compiler Support for Scalable and Efficient Memory Systems , 2001, IEEE Trans. Computers.

[23]  H. T. Kung,et al.  The Warp Computer: Architecture, Implementation, and Performance , 1987, IEEE Transactions on Computers.

[24]  Henry Hoffmann,et al.  Stream Algorithms and Architecture , 2004, J. Instr. Level Parallelism.

[25]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[26]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '98.

[27]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[28]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[29]  William J. Dally,et al.  The J-machine Multicomputer: An Architectural Evaluation , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[30]  A. J. KleinOsowski,et al.  MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.

[31]  John Kubiatowicz,et al.  Integrated shared-memory and message-passing communication in the Alewife multiprocessor , 1998 .

[32]  Henk Corporaal,et al.  Partitioned register file for TTAs , 1995, MICRO 1995.

[33]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[34]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[35]  Arvind,et al.  The Evolution of Dataflow Architectures: from Static Dataflow to P-RISC , 1993, Int. J. High Speed Comput..

[36]  JAMES DEMMEL,et al.  LAPACK: A portable linear algebra library for high-performance computers , 1990, Proceedings SUPERCOMPUTING '90.

[37]  T. Gross,et al.  !Warp-anatomy of a parallel computing system , 1999, IEEE Concurrency.

[38]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[39]  Cameron McNairy,et al.  Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[40]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[41]  Timothy Mark Pinkston,et al.  A Progressive Approach to Handling Message-Dependent Deadlock in Parallel Computer Systems , 2003, IEEE Trans. Parallel Distributed Syst..

[42]  David Wentzlaff Architectural implications of bit-level computation in communication applications , 2002 .

[43]  Nate Kushman,et al.  Performance Nonmonotonicities: A Case Study of the UltraSPARC Processor , 1998 .

[44]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[45]  Paul S. Zuchowski,et al.  Technology-migratable ASIC library design , 1996, IBM J. Res. Dev..

[46]  K. Steinhubl Design of Ion-Implanted MOSFET'S with Very Small Physical Dimensions , 1974 .

[47]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[48]  Robert H. Dennard,et al.  CMOS scaling for high performance and low power-the next ten years , 1995, Proc. IEEE.

[49]  Xia Chen,et al.  A spatial path scheduling algorithm for EDGE architectures , 2006, ASPLOS XII.

[50]  David G. Chinnery,et al.  Closing the Gap Between ASIC and Custom - Tools and Techniques for High-Performance ASIC Design , 2002 .

[51]  M. Bohr Interconnect scaling-the real limiter to high performance ULSI , 1995, Proceedings of International Electron Devices Meeting.

[52]  Norman P. Jouppi,et al.  The multicluster architecture: reducing cycle time through partitioning , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[53]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[54]  Steven Swanson,et al.  Instruction scheduling for a tiled dataflow architecture , 2006, ASPLOS XII.

[55]  Gerald H. Hilderink,et al.  Parallel Processing — the picoChip way! , 2003 .

[56]  José Duato,et al.  A Necessary and Sufficient Condition for Deadlock-Free Adaptive Routing in Wormhole Networks , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[57]  Stephen H. Gunther,et al.  Managing the Impact of Increasing Microprocessor Power Consumption , 2001 .

[58]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[59]  William Thies,et al.  Linear analysis and optimization of stream programs , 2003, PLDI '03.

[60]  Aaron Smith,et al.  Compiling for EDGE architectures , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[61]  William J. Dally,et al.  The Imagine Stream Processor , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[62]  Victor Lee,et al.  Exploiting two-case delivery for fast protected messaging , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[63]  Kathryn S. McKinley,et al.  Static placement, dynamic issue (SPDI) scheduling for EDGE architectures , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[64]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture , 2003, IEEE Micro.

[65]  Christopher Batten,et al.  The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[66]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[67]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[68]  Donald Yeung,et al.  SimpleFit: A Framework for Analyzing Design Trade-Offs in Raw Architectures , 2001, IEEE Trans. Parallel Distributed Syst..

[69]  Ho-Seop Kim,et al.  An instruction set and microarchitecture for instruction level distributed processing , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[70]  Rajeev Barua,et al.  Compiler-managed memory system for software-exposed architectures , 2000 .

[71]  Henk Corporaal,et al.  MOVE: a framework for high-performance processor design , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[72]  Jiawen Chen,et al.  A reconfigurable architecture for load-balanced rendering , 2005, HWWS '05.

[73]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[74]  Stephen P. Crago,et al.  A performance analysis of PIM, stream processing, and tiled processing on memory-intensive signal processing kernels , 2003, ISCA '03.

[75]  R. Nagarajan,et al.  A design space evaluation of grid processor architectures , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[76]  David Shoemaker,et al.  NuMesh: An architecture optimized for scheduled communication , 2004, The Journal of Supercomputing.

[77]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[78]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[79]  Christoforos E. Kozyrakis,et al.  Overcoming the limitations of conventional vector processors , 2003, ISCA '03.

[80]  David Wentzlaff,et al.  Energy characterization of a tiled architecture processor with on-chip networks , 2003, ISLPED '03.

[81]  Anant Agarwal,et al.  Scalar operand networks , 2005, IEEE Transactions on Parallel and Distributed Systems.

[82]  Steven Swanson,et al.  The WaveScalar architecture , 2007, TOCS.

[83]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[84]  Thomas Schubert,et al.  High-level formal verification of next-generation microprocessors , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[85]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[86]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[87]  John Wawrzynek,et al.  Garp: a MIPS processor with a reconfigurable coprocessor , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[88]  Anant Agarwal,et al.  Software orchestration of instruction level parallelism on tiled processor architectures , 2005 .

[89]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[90]  Anant Agarwal,et al.  Anatomy of a message in the Alewife multiprocessor , 1993, ICS '93.

[91]  P. Bai,et al.  A high performance 180 nm generation logic technology , 1998, International Electron Devices Meeting 1998. Technical Digest (Cat. No.98CH36217).

[92]  P. Buffet,et al.  Methodology for I/O cell placement and checking in ASIC designs using area-array power grid , 2000, Proceedings of the IEEE 2000 Custom Integrated Circuits Conference (Cat. No.00CH37044).

[93]  Michael Taylor Deionizer: A Tool for Capturing and Embedding I/O Cells , 2004 .

[94]  William Thies,et al.  Optimizing stream programs using linear state space analysis , 2005, CASES '05.

[95]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[96]  Matthew Mattina,et al.  Tarantula: a vector extension to the alpha architecture , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[97]  Henk Corporaal Transport Triggered Architectures : Design and Evaluation , 1995 .

[98]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[99]  Doug Matzke,et al.  Will Physical Scalability Sabotage Performance Gains? , 1997, Computer.

[100]  Anant Agarwal,et al.  How to build scalable on-chip ILP networks for a decentralized architecture , 2000 .

[101]  Saman P. Amarasinghe,et al.  Maps: a compiler-managed memory system for Raw machines , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[102]  William J. Dally,et al.  A VLSI Architecture for Concurrent Data Structures , 1987 .

[103]  William J. Dally,et al.  A bandwidth-efficient architecture for media processing , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.