APRIL: a processor architecture for multiprocessing

Processors in large-scale multiprocessors must be able to tolerate large communication latencies and synchronization delays. This paper describes the architecture of a rapid-context-switching processor called APRIL with support for fine-grain threads and synchronization. APRIL achieves high single-thread performance and supports virtual dynamic threads. A commercial RISC-based implementation of APRIL and a run-time software system that can switch contexts in about 10 cycles is described. Measurements taken for several parallel applications on an APRIL simulator show that the overhead for supporting parallel tasks based on futures is reduced by a factor of two over a corresponding implementation on the Encore Multimax. The scalability of a multiprocessor based on APRIL is explored using a performance model. We show that the SPARC-based implementation of APRIL can achieve close to 80% processor utilization with as few as three resident threads per processor in a large-scale cache-based machine with an average base network latency of 55 cycles.

[1]  E. Davidson,et al.  Special Feature: Developing a Multiple-Instructon-Stream Single-Chip Processor , 1979, Computer.

[2]  HennessyJohn,et al.  Hardware/software tradeoffs for increased performance , 1982 .

[3]  Thomas R. Gross,et al.  Postpass Code Optimization of Pipeline Constraints , 1983, TOPL.

[4]  Charles L. Seitz,et al.  Concurrent VLSI Architectures , 1984, IEEE Transactions on Computers.

[5]  Charles L. Seitz,et al.  The cosmic cube , 1985, CACM.

[6]  Colin Whitby-Strevens The transputer , 1985, ISCA 1985.

[7]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[8]  James R. Larus,et al.  Design Decisions in SPUR , 1986, Computer.

[9]  B J Smith,et al.  A pipelined, shared resource MIMD computer , 1986 .

[10]  David W. Wall,et al.  Global register allocation at link time , 1986, SIGPLAN '86.

[11]  Arvind,et al.  Two Fundamental Issues in Multiprocessing , 1987, Parallel Computing in Science and Engineering.

[12]  Anant Agarwal,et al.  MIPS-X: a 20-MIPS peak, 32-bit microprocessor with on-chip cache , 1987 .

[13]  Andrew A. Chien,et al.  Architecture of a message-driven processor , 1987, ISCA '87.

[14]  Robert A. Iannucci Toward a dataflow/von Neumann hybrid architecture , 1988, ISCA '88.

[15]  Michel Dubois,et al.  Synchronization, coherence, and event ordering in multiprocessors , 1988, Computer.

[16]  MASA: a multithreaded processor architecture for parallel symbolic computing , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[17]  Randy H. Katz,et al.  Architectural Support for Programming Languages and Operating Systems , 1988 .

[18]  Robert H. Halstead,et al.  MASA: a multithreaded processor architecture for parallel symbolic computing , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[19]  Charles L. Seitz,et al.  Multicomputers: message-passing concurrent computers , 1988, Computer.

[20]  Anoop Gupta,et al.  Exploring The Benefits Of Multiple Hardware Contexts In A Multiprocessor Architecture: Preliminary Results , 1989, The 16th Annual International Symposium on Computer Architecture.

[21]  Rishiyur S. Nikhil,et al.  Can Dataflow Subsume Von Neumann Computing? , 1989, The 16th Annual International Symposium on Computer Architecture.

[22]  Peter Steenkiste,et al.  A simple interprocedural register allocation algorithm and its effectiveness for LISP , 1989, TOPL.

[23]  R. S. Nikhil Can dataflow subsume von Neumann computing? , 1989, ISCA '89.

[24]  A. Gupta,et al.  Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results , 1989, ISCA '89.

[25]  Keshav Pingali,et al.  I-structures: data structures for parallel computing , 1986, Graph Reduction.

[26]  Robert H. Halstead,et al.  Mul-T: a high-performance parallel Lisp , 1989, PLDI '89.

[27]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[28]  Anant Agarwal,et al.  Directory-based cache coherence in large-scale multiprocessors , 1990, Computer.

[29]  Anant Agarwal,et al.  Performance Tradeoffs in Multithreaded Processors , 1992, IEEE Trans. Parallel Distributed Syst..