Architectural and implementation tradeoffs in the design of multiple-context processors

Multiple-context processors have been proposed as an architectural technique to mitigate the effects of large memory latency in multiprocessors. We examine two schemes for implementing multiple-context processors. The first scheme switches between contexts only on a cache miss, while the other interleaves the contexts on a cycle-by-cycle basis. Both schemes provide the capability for a single context to fully utilize the pipeline. We show that cycle-by-cycle interleaving of contexts provides a performance advantage over switching contexts only at a cache miss. This advantage results from the context interleaving hiding pipeline dependencies and reducing the context switch cost. In addition, we show that while the implementation of the interleaved scheme is more complex, the complexity is not overwhelming. As pipelines get deeper and operate at lower percentages of peak performance, the performance advantage of the interleaved scheme is likely to justify its additional complexity. Kev Words and Phrases: multiple-context processors, multithreading, latency hiding, multiprocessors, pipelining

[1]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[2]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, ISCA '90.

[3]  Helen Davis,et al.  Tango introduction and tutorial , 1990 .

[4]  James H. Patterson,et al.  Portable Programs for Parallel Processors , 1987 .

[5]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[6]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[7]  Andrew R. Pleszkun,et al.  Strategies for achieving improved processor throughput , 1991, ISCA '91.

[8]  A. Gupta,et al.  Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results , 1989, ISCA '89.

[9]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[10]  Michel Dubois,et al.  Lockup-free Caches in High-Performance Multiprocessors , 1990, J. Parallel Distributed Comput..

[11]  Kevin P. McAuliffe,et al.  The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture , 1985, ICPP.

[12]  David Chaiken,et al.  Latency Tolerance through Multithreading in Large-Scale Multiprocessors , 1991 .

[13]  Chuan-lin Wu,et al.  A Benchmark Evaluation of a Multi-threaded RISC Processor Architecture , 1991, ICPP.

[14]  Anoop Gupta,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, ISCA '91.

[15]  David E. Culler,et al.  Analysis of multithreaded architectures for parallel computing , 1990, SPAA '90.

[16]  Hwa C. Torng,et al.  The Concurrent Execution of Multiple Instruction Streams on Superscalar Processors , 1991, ICPP.

[17]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[18]  Robert H. Halstead,et al.  MASA: a multithreaded processor architecture for parallel symbolic computing , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[19]  Gerry Kane,et al.  MIPS RISC Architecture , 1987 .

[20]  Alan Jay Smith,et al.  Branch Prediction Strategies and Branch Target Buffer Design , 1995, Computer.

[21]  Lars Lundberg,et al.  A Lockup-Free Multiprocessor Cache Design , 1991, ICPP.