Scalar operand networks: on-chip interconnect for ILP in partitioned architectures

The bypass paths and multiported register files in microprocessors serve as an implicit interconnect to communicate operand values among pipeline stages and multiple ALU. Previous superscalar designs implemented this interconnect using centralized structures that do not scale with increasing ILP demands. In search of scalability, recent microprocessor designs in industry and academia exhibit a trend towards distributed resources such as partitioned register files, banked caches, multiple independent compute pipelines, and even multiple program counters. Some of these partitioned microprocessor designs have begun to implement bypassing and operand transport using point-to-point interconnects rather than centralized networks. We call interconnects optimized for scalar data transport, whether centralized or distributed, scalar operand networks. Although these networks share many of the challenges of multiprocessor networks such as scalability and deadlock avoidance, they have many unique requirements, including ultra-low latencies (a few cycles versus tens of cycles) and ultra-fast operation-operand matching. This paper discusses the unique properties of scalar operand networks, examines alternative ways of implementing them, and describes in detail the implementation of one such network in the Raw microprocessor. The paper analyzes the performance of these networks for ILP workloads and the sensitivity of overall ILP performance to network properties.

[1]  James E. Smith,et al.  An instruction set and microarchitecture for instruction level distributed processing , 2002, ISCA.

[2]  Henk Corporaal,et al.  Partitioned register file for TTAs , 1995, MICRO 1995.

[3]  Rajeev Barua,et al.  Maps: a compiler-managed memory system for raw machines , 1999, ISCA.

[4]  Emmett Witchel,et al.  Increasing and detecting memory address congruence , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[5]  Anant Agarwal,et al.  How to build scalable on-chip ILP networks for a decentralized architecture , 2000 .

[6]  Emmett Witchel,et al.  Techniques for Increasing and Detecting Memory Alignment , 2001 .

[7]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[8]  William J. Dally,et al.  A VLSI Architecture for Concurrent Data Structures , 1987 .

[9]  Samuel D. Naffziger,et al.  The implementation of the next-generation 64b itanium microprocessor , 2002 .

[10]  T. Gross,et al.  !Warp-anatomy of a parallel computing system , 1999, IEEE Concurrency.

[11]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[12]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[13]  Arvind,et al.  The Evolution of Dataflow Architectures: from Static Dataflow to P-RISC , 1993, Int. J. High Speed Comput..

[14]  Wen-mei W. Hwu,et al.  Unrolling-based optimizations for modulo scheduling , 1995, MICRO 1995.

[15]  Karthikeyan Sankaralingam,et al.  A design space evaluation of grid processor architectures , 2001, MICRO.

[16]  Victor Lee,et al.  The RAW benchmark suite: computation structures for general purpose computing , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[17]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[18]  Anant Agarwal,et al.  Anatomy of a message in the Alewife multiprocessor , 1993, ICS '93.

[19]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).