Processor coupling: integrating compile time and runtime scheduling for parallelism

The technology to implement a single-chip node composed of 4 high-performance floating-point ALUs will be available by 1995. This paper presents processor coupling, a mechanism for controlling multiple ALUs to exploit both instruction-level and inter-thread parallelism, by using compile time and runtime scheduling. The compiler statically schedules individual threads to discover available intra-thread instruction-level parallelism. The runtime scheduling mechanism interleaves threads, exploiting inter-thread parallelism to maintain high ALU utilization. ALUs are assigned to threads on a cycle by cycle basis, and several threads can be active concurrently. We provide simulation results demonstrating that, on four simple numerical benchmarks, processor coupling achieves better performance than purely statically scheduled or multi-processor machine organizations. We examine how performance is affected by restricted communication between ALUs and by long memory latencies. We also present an implementation and feasibility study of a processor coupled node.

[1]  Robert P. Colwell,et al.  Architecture and implementation of a VLIW supercomputer , 1990, Proceedings SUPERCOMPUTING '90.

[2]  David E. Culler,et al.  Dataflow architectures , 1986 .

[3]  Tetsuya Fujita,et al.  A Multithreaded Processor Architecture for Parallel Symbolic Computation. , 1987 .

[4]  Burton J. Smith,et al.  The Horizon supercomputing system: architecture and software , 1988, Proceedings. SUPERCOMPUTING '88.

[5]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[6]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS 1989.

[7]  William J. Dally,et al.  A mechanism for efficient context switching , 1991, [1991 Proceedings] IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[8]  John R. Ellis,et al.  Bulldog: A Compiler for VLIW Architectures , 1986 .

[9]  A. Gupta,et al.  Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results , 1989, ISCA '89.

[10]  Robert A. Iannucci Toward a dataflow/von Neumann hybrid architecture , 1988, ISCA '88.

[11]  P. Sadayappan,et al.  Circuit Simulation on Shared-Memory Multiprocessors , 1988, IEEE Trans. Computers.

[12]  Andrew Wolfe,et al.  A variable instruction stream extension to the VLIW architecture , 1991, ASPLOS IV.

[13]  Mike Johnson,et al.  Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[14]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[15]  Robert H. Halstead,et al.  MASA: a multithreaded processor architecture for parallel symbolic computing , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[16]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS III.

[17]  J A Fisher,et al.  Instruction-Level Parallel Processing , 1991, Science.

[18]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS.

[19]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[20]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[21]  Stephen W. Keckler A Coupled Multi-ALU Processing Node for a Highly Parallel Computer , 1992 .