Multithreaded Systems

ion. In either case, a miss on the local memory requires a request to be issued to the remote memory, and a reply to be sent back to the requesting processor. Stalls due to the round-trip communication latency are and will continue to be an aggravating factor that limits the performance of scalable DSM systems. Memory latency, while growing, is not a new phenomenon. There have been varied efforts to resolve the memory latency problem. The most obvious approach is to reduce the physical latencies in the system. This involves making the pathway between the processor requesting the data and the remote memory that contains the data as efficient as possible, e.g., reducing the software overhead of sending and receiving messages and improving the connectivity of networks. The second approach is to reduce the frequency of long latency operations, by keeping data local to the processor that needs it. When data locality cannot be exploited, prefetching or block transferring (as opposed to cache-line transfers) of data 8 can be used. Caches are the most prevalent solution to the problem of memory latency. Unfortunately, they do not perform well if an application’s memory access patterns do not conform to hard-wired policies. Furthermore, increasing cache capacities, while consuming an increasingly large silicon areas on processor chips, will only result in diminishing returns. Although the aforementioned approaches reduce latency, they do not eliminate it. Multithreading has emerged as a promising and exciting avenue to tolerate the latency that cannot be eliminated. A multithreaded system contains multiple “loci of control” (or threads) within a single program; the processor is shared by these multiple threads leading to higher utilization. The processor may switch between the threads to not only to hide memory latency but other long latency operations, such as I/O latency, or interleave instructions on a cycle-by-cycle basis from multiple threads to minimize pipeline breaks due to dependencies among instructions within a single thread. Multithreading has also been used strictly as a programming paradigm on general purpose hardware to exploit thread parallelism on SMPs and to increase applications’ throughput and responsiveness. However, lately, there is an increasing interest in providing hardware support for multithreading. Without adequate hardware support, such as multiple hardware contexts, fast context-switch, non-blocking caches, out-of-order instruction issue and completion, register renaming, we will not be able to take full advantage of the multithreading model of computation. As the feature size of logic devices reduces, we feel that the silicon area can be put to better use by providing support for multithreading. The idea of multithreading is not new. Fine-grained multithreading was implicit in the dataflow model of computation [34]. Multiple hardware contexts (i.e., register files, PSWs) to speed up switching between threads were implemented in systems such as Dorado [38], HEP [42], and Tera [4]. Some of these systems were not successful due to a lack of innovations in programming languages, run-time systems, and operating system kernels. There is, however, a renewed interest in multithreading primarily due to a confluence of several independent research directions which have united over a common set of issues and techniques. A number of research projects are underway for designing multithreaded systems

[1]  Emin Gün Sirer,et al.  SPIN: an extensible microkernel for application-specific operating system services , 1994, EW 6.

[2]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[3]  Anoop Gupta,et al.  Integration of message passing and shared memory in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[4]  Mitsuhisa Sato,et al.  The EM-X parallel computer: architecture and basic performance , 1995, ISCA.

[5]  Dawson R. Engler,et al.  Exokernel: an operating system architecture for application-level resource management , 1995, SOSP.

[6]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[7]  Seth Copen Goldstein,et al.  TAM - A Compiler Controlled Threaded Abstract Machine , 1993, J. Parallel Distributed Comput..

[8]  Larry Rudolph,et al.  Message passing support on StarT-Voyager , 1998, Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238).

[9]  Devang Shah,et al.  Implementing Lightweight Threads , 1992, USENIX Summer.

[10]  William J. Dally,et al.  The M-machine multicomputer , 1997, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[11]  Susan J. Eggers,et al.  The effectiveness of multiple hardware contexts , 1994, ASPLOS VI.

[12]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[13]  Brian N. Bershad,et al.  Scheduler activations: effective kernel support for the user-level management of parallelism , 1991, TOCS.

[14]  Kenneth A. Pier A retrospective on the Dorado, a high-performance personal computer , 1983, ISCA '83.

[15]  Arvind,et al.  T: A Multithreaded Massively Parallel Architecture , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[16]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[17]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[18]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[19]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[20]  Joseph Boykin,et al.  Programming Under Mach , 1993 .

[21]  Norman P. Jouppi,et al.  How useful are non-blocking loads, stream buffers and speculative execution in multiple issue processors? , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[22]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[23]  Anant Agarwal,et al.  Performance Tradeoffs in Multithreaded Processors , 1992, IEEE Trans. Parallel Distributed Syst..

[24]  G. Andrew Boughton Arctic Routing Chip , 1994, PCRCW.

[25]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[26]  Donald Yeung,et al.  Sparcle: an evolutionary processor design for large-scale multiprocessors , 1993, IEEE Micro.

[27]  Michael S. Ehrlich,et al.  StarT-jr : a parallel system from commodity technology , 1997 .

[28]  Mitsuhisa Sato,et al.  Dynamic Characteristics of Multithreaded Execution in the EM-X Multiprocessor , 1995 .

[29]  Pierluigi Civera,et al.  Multiprocessor System Architecture , 1985 .

[30]  Scott Oaks,et al.  Java Threads , 1997 .

[31]  Devang Shah,et al.  Programming with threads , 1996 .

[32]  Michel Dubois,et al.  International Conference on Parallel Processing Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors , 2006 .

[33]  James C. Hoe,et al.  START-NG: Delivering Seamless Parallel Computing , 1995, Euro-Par.

[34]  Rishiyur S. Nikhil,et al.  Cid: A Parallel, "Shared-Memory" C for Distributed-Memory Machines , 1994, LCPC.

[35]  Susan J. Eggers,et al.  Impact of sharing-based thread placement on multithreaded architectures , 1994, ISCA '94.

[36]  Edward D. Lazowska,et al.  User-Level Threads and Interprocess Communication , 1993 .

[37]  Kevin B. Theobald,et al.  Panel Sessions of The 1991 Workshop on Multithreaded Computers , 1993 .

[38]  Brian N. Bershad,et al.  Lightweight remote procedure call , 1989, TOCS.

[39]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[40]  Ali R. Hurson,et al.  Dataflow architectures and multithreading , 1994, Computer.

[41]  Mark Smith,et al.  Beyond Multiprocessing: Multithreading the SunOS Kernel , 1992, USENIX Summer.