Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results

A fundamental problem that any scalable multiprocessor must address is the ability to tolerate high latency memory operations. This paper explores the extent to which multiple hardware contexts per processor can help to mitigate the negative effects of high latency. In particular, we evaluate the performance of a directory-based cache coherent multiprocessor using memory reference traces obtained from three parallel applications. We explore the case where there are a small fixed number (2-4) of hardware contexts per processor and the context switch overhead is low. In contrast to previously proposed approaches, we also use a very simple context switch criterion, namely a cache miss or a write-hit to shared data. Our results show that the effectiveness of multiple contexts depends on the nature of the applications, the context switch overhead, and the inherent latency of the machine architecture. Given reasonably low overhead hardware context switches, we show that two or four contexts can achieve substantial performance gains over a single context. For one application, the processor utilization increased by about 46% with two contexts and by about 80% with four contexts.

[1]  David A. Patterson,et al.  Reduced instruction set computers , 1985, CACM.

[2]  Robert H. Halstead,et al.  MASA: a multithreaded processor architecture for parallel symbolic computing , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[3]  Alain J. Martin,et al.  The architecture and programming of the Ametek series 2010 multicomputer , 1988, C3P.

[4]  Arvind,et al.  A critique of multiprocessing von Neumann style , 1983, ISCA '83.

[5]  Jonathan Rose The Parallel Decomposition and Implementation of an Integrated Circuit Global Router , 1988, PPOPP/PPEALS.

[6]  Robert A. Iannucci,et al.  A dataflow/von Neumann hybrid architecture , 1988 .

[7]  Robert A. Iannucci Toward a dataflow/von Neumann hybrid architecture , 1988, ISCA '88.

[8]  J. Mcdonald,et al.  Vectorization of a particle simulation method for hypersonic rarefied flow , 1988 .

[9]  Jonathan Rose The parallel decomposition and implementation of an integrated circuit global router , 1988, PPoPP 1988.

[10]  Robert H. Thomas,et al.  Performance Measurements on a 128-Node Butterfly Parallel Processor , 1985, International Conference on Parallel Processing.

[11]  Thomas R. Gross,et al.  Measurement and evaluation of the MIPS architecture and processor , 1988, TOCS.

[12]  Kevin P. McAuliffe,et al.  The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture , 1985, ICPP.

[13]  Jonathan Rose LocusRoute: a parallel global router for standard cells , 1988, 25th ACM/IEEE, Design Automation Conference.Proceedings 1988..

[14]  Andrew A. Chien,et al.  Architecture of a message-driven processor , 1987, ISCA '87.

[15]  John P. Hayes,et al.  A Microprocessor-based Hypercube Supercomputer , 1986, IEEE Micro.

[16]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[17]  Anoop Gupta,et al.  Process control and scheduling issues for multiprogrammed shared-memory multiprocessors , 1989, SOSP '89.

[18]  Anoop Gupta,et al.  Characterization of Parallelism and Deadlocks in Distributed Digital Logic Simulation , 1988, 26th ACM/IEEE Design Automation Conference.