PL/PS: A Non-Blocking Multithreaded Architecture With Decoupled Memory And Pipelines

In this paper we propose a new approach to building multithreaded uni-processors that become building blocks in high-end computing architectures. Our innovativeness stems from a multithreaded architecture with non-blocking threads where all memory accesses are decoupled from the thread execution. Data is pre-loaded into the thread context (registers), and all results are "post-stored" after the completion of the thread execution. The decoupling of memory accesses from thread execution requires a separate unit to perform the necessary pre-loads and post-stores, and controlling the allocation of hardware thread contexts to enabled threads. This separation facilitates for achieving high locality and minimizing the impact of distribution and hierarchy in large memory systems. The nonblocking nature of threads eliminates the need for thread switching, thus improving the overhead in scheduling threads. We will present our preliminary results obtained from a Monte Carlo simulator that compares the performance of the proposed system with conventional architectures for randomly generated threads.

[1]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[2]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[3]  Krishna M. Kavi,et al.  Design of cache memories for dataflow architecture , 1998, J. Syst. Archit..

[4]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[5]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[6]  D. Burger,et al.  Datascalar Architectures , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[7]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[8]  Seth Copen Goldstein,et al.  TAM - A Compiler Controlled Threaded Abstract Machine , 1993, J. Parallel Distributed Comput..

[9]  Burton J. Smith,et al.  The architecture of HEP , 1985 .

[10]  Norman P. Jouppi,et al.  How useful are non-blocking loads, stream buffers and speculative execution in multiple issue processors? , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[11]  Kenneth A. Pier A retrospective on the Dorado, a high-performance personal computer , 1983, ISCA '83.

[12]  E CullerDavid,et al.  TAMa compiler controlled threaded abstract machine , 1993 .

[13]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[14]  Michel Dubois,et al.  Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[15]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[16]  Ali R. Hurson,et al.  Dataflow architectures and multithreading , 1994, Computer.

[17]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[18]  Krishna M. Kavi,et al.  Design of cache memories for multi-threaded dataflow architecture , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[19]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).