Scalar operand networks

The bypass paths and multiported register files in microprocessors serve as an implicit interconnect to communicate operand values among pipeline stages and multiple ALUs. Previous superscalar designs implemented this interconnect using centralized structures that do not scale with increasing ILP demands. In search of scalability, recent microprocessor designs in industry and academia exhibit a trend toward distributed resources such as partitioned register files, banked caches, multiple independent compute pipelines, and even multiple program counters. Some of these partitioned microprocessor designs have begun to implement bypassing and operand transport using point-to-point interconnects. We call interconnects optimized for scalar data transport, whether centralized or distributed, scalar operand networks. Although these networks share many of the challenges of multiprocessor networks such as scalability and deadlock avoidance, they have many unique requirements, including ultra-low latency (a few cycles versus tens of cycles) and ultra-fast operation-operand matching. This work discusses the unique properties of scalar operand networks (SONs), examines alternative ways of implementing them, and introduces the AsTrO taxonomy to distinguish between them. It discusses the design of two alternative networks in the context of the Raw microprocessor, and presents timing, area, and energy statistics for a real implementation. The paper also presents a 5-tuple performance model for SONs and analyzes their performance sensitivity to network properties for ILP workloads.

[1]  Anant Agarwal,et al.  Scalar operand networks: on-chip interconnect for ILP in partitioned architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[2]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[3]  Timothy Mark Pinkston,et al.  A Progressive Approach to Handling Message-Dependent Deadlock in Parallel Computer Systems , 2003, IEEE Trans. Parallel Distributed Syst..

[4]  G. Hammond,et al.  The implementation of the next-generation 64 b Itanium/sup TM/ microprocessor , 2002, 2002 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.02CH37315).

[5]  Victor Lee,et al.  Exploiting two-case delivery for fast protected messaging , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[6]  Karthikeyan Sankaralingam,et al.  A design space evaluation of grid processor architectures , 2001, MICRO.

[7]  José Duato,et al.  A General Theory for Deadlock-Free Adaptive Routing Using a Mixed Set of Resources , 2001, IEEE Trans. Parallel Distributed Syst..

[8]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[9]  Victor Lee,et al.  The RAW benchmark suite: computation structures for general purpose computing , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[10]  Anant Agarwal,et al.  Anatomy of a message in the Alewife multiprocessor , 1993, ICS '93.

[11]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[12]  Arvind,et al.  The Evolution of Dataflow Architectures: from Static Dataflow to P-RISC , 1993, Int. J. High Speed Comput..

[13]  Karthikeyan Sankaralingam,et al.  Routed inter-ALU networks for ILP scalability and performance , 2003, Proceedings 21st International Conference on Computer Design.

[14]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[15]  Anant Agarwal,et al.  How to build scalable on-chip ILP networks for a decentralized architecture , 2000 .

[16]  James E. Smith,et al.  An instruction set and microarchitecture for instruction level distributed processing , 2002, ISCA.

[17]  Sharad Malik,et al.  Power-driven Design of Router Microarchitectures in On-chip Networks , 2003, MICRO.

[18]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[19]  Henk Corporaal,et al.  Partitioned register file for TTAs , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[20]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[21]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[22]  Dhabaleswar K. Panda,et al.  Multidestination Message Passing in Wormhole k-ary n-cube Networks with Base Routing Conformed Paths , 1999, IEEE Trans. Parallel Distributed Syst..

[23]  AgarwalAnant,et al.  Baring It All to Software , 1997 .

[24]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[25]  William J. Dally,et al.  A VLSI Architecture for Concurrent Data Structures , 1987 .

[26]  David Wentzlaff,et al.  Energy characterization of a tiled architecture processor with on-chip networks , 2003, ISLPED '03.

[27]  José Duato,et al.  Efficient interconnects for clustered microarchitectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[28]  William J. Dally,et al.  Flit-reservation flow control , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[29]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[30]  Samuel D. Naffziger,et al.  The implementation of the next-generation 64b itanium microprocessor , 2002 .

[31]  T. Gross,et al.  !Warp-anatomy of a parallel computing system , 1999, IEEE Concurrency.

[32]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[33]  Theodore R. Bashkow,et al.  A large scale, homogeneous, fully distributed parallel machine, I , 1977, ISCA '77.

[34]  Rajeev Barua,et al.  Maps: a compiler-managed memory system for raw machines , 1999, ISCA.