A Non-Blocking Multithreaded Architecture

In this paper we present a new approach to building multithreaded uni-processors. Our innovativeness stems from an architecture with non-blocking threads where all memory accesses are decoupled from the thread execution. Data is pre-loaded into the thread context (registers), and all results are "post-stored" after the completion of the thread execution. The decoupling of memory accesses from thread execution requires a separate unit to perform the necessary pre-loads and post-stores, and controlling the allocation of hardware thread contexts to enabled threads. This separation facilitates for achieving high locality and minimizing the impact of distribution and hierarchy in large memory systems. We present our preliminary results obtained from a Monte Carlo simulator that compares the performance of the proposed system with conventional architectures for randomly generated threads.

[1]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[2]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[3]  Krishna M. Kavi,et al.  Design of cache memories for multi-threaded dataflow architecture , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[4]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[5]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[6]  Burton J. Smith,et al.  The architecture of HEP , 1985 .

[7]  Seth Copen Goldstein,et al.  TAM - A Compiler Controlled Threaded Abstract Machine , 1993, J. Parallel Distributed Comput..

[8]  Kenneth A. Pier A retrospective on the Dorado, a high-performance personal computer , 1983, ISCA '83.

[9]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[10]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[11]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).