An instruction set and microarchitecture for instruction level distributed processing

An instruction set architecture (ISA) suitable for future microprocessor design constraints is proposed. The ISA has hierarchical register files with a small number of accumulators at the top. The instruction stream is divided into chains of dependent instructions (strands) where intra-strand dependences are passed through the accumulator. The general-purpose register file is used for communication between strands and for holding global values that have many consumers. A microarchitecture to support the proposed ISA is proposed and evaluated. The microarchitecture consists of multiple, distributed processing elements. Each PE contains an instruction issue FIFO, a local register (accumulator) and local copy! of register file. The overall simplicity, hierarchical value communication, and distributed implementation will provide a very high clock speed and a relatively short pipeline while maintaining a form of superscalar out-of-order execution. Detailed timing simulations using translated program traces show the proposed microarchitecture is tolerant of global wire latencies. Ignoring the significant clock frequency advantages, a microarchitecture that supports a 4-wide fetch/decode pipeline, 8 serial PEs, and a two-cycle inter-PE communication latency performs as well as a conventional 4-way out-of-order superscalar processor.

[1]  J.E. Smith,et al.  The Astronautics ZS-1 processor , 1988, Proceedings 1988 IEEE International Conference on Computer Design: VLSI.

[2]  H. B. Bakoglu,et al.  The IBM RISC System/6000 Processor: Hardware Overview , 1990, IBM J. Res. Dev..

[3]  Gurindar S. Sohi,et al.  The Expandable Split Window Paradigm for Exploiting Fine-grain Parallelism , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[4]  Gurindar S. Sohi,et al.  Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors , 1992, MICRO 1992.

[5]  Michael J. Flynn,et al.  Computer Architecture: Pipelined and Parallel Processor Design , 1995 .

[6]  Manoj Franklin,et al.  PEWs: a decentralized dynamic scheduler for ILP processing , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[7]  Doug Burger,et al.  Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .

[8]  S. Vajapeyam,et al.  Improving Superscalar Instruction Dispatch And Issue By Exploiting Dynamic Code Sequences , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[9]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, ISCA.

[10]  Norman P. Jouppi,et al.  The multicluster architecture: reducing cycle time through partitioning , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[11]  Retrospective: multiscalar processors , 1998, ISCA '98.

[12]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[13]  Ramon Canal,et al.  A cost-effective clustered architecture , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[14]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[15]  A. Klaiber The Technology Behind Crusoe TM Processors Low-power x 86-Compatible Processors Implemented with Code Morphing , 2000 .

[16]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[17]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[18]  James E. Smith,et al.  Instruction Level Distributed Processing , 2000, HiPC.

[19]  John Paul Shen,et al.  Instruction path coprocessors , 2000, ISCA '00.

[20]  Michael Gschwind,et al.  Dynamic Binary Translation and Optimization , 2001, IEEE Trans. Computers.

[21]  R. Nagarajan,et al.  A design space evaluation of grid processor architectures , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[22]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.