Generation, Optimization, and Evaluation of Multithreaded Code

The recent advent of multithreaded architectures holds many promises: the exploitation of intrathread locality and the latency tolerance of multithreaded synchronization can result in a more efficient processor utilization and higher scalability. The challenge for a code generation scheme is to make effective use of the underlying hardware by generating large threads with a large degree of internal locality without limiting the program level parallelism or increasing latency. Top-down code generation, where threads are created directly from the compiler's intermediate form, is effective at creating a relatively large thread. However, having only a limited view of the code at any one time limits the quality of threads generated. These top-down generated threads can therefore be optimized by global, bottom-up optimization techniques. In this paper, we introduce the Pebbles multithreaded model of computation and analyze a code generation scheme whereby top-down code generation is combined with bottom-up optimizations. We evaluate the effectiveness of this scheme in terms of overall performance and specific thread characteristics such as size, length, instruction level parallelism, number of inputs, and synchronization costs.

[1]  David E. Culler,et al.  Global analysis for partitioning non-strict programs into sequential threads , 1992, LFP '92.

[2]  David E. Culler,et al.  The Explicit Token Store , 1990, J. Parallel Distributed Comput..

[3]  Toshitsugu Yuba,et al.  An Architecture Of A Dataflow Single Chip Processor , 1989, The 16th Annual International Symposium on Computer Architecture.

[4]  Walid A. Najjar,et al.  An Evaluation of Optimized Threaded Code Generation , 1994, IFIP PACT.

[5]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[6]  William J. Dally,et al.  The message-driven processor: a multicomputer processing node with efficient mechanisms , 1992, IEEE Micro.

[7]  A. Gupta,et al.  Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results , 1989, ISCA '89.

[8]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[9]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[10]  V. Gerald Grafe,et al.  Compile-time partitioning of a non-strict language into sequential threads , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[11]  H.H.J. Hum,et al.  Supporting a dynamic SPMD in a multi-threaded architecture , 1993, Digest of Papers. Compcon Spring.

[12]  David E. Culler,et al.  Compiler-Controlled Multithreading for Lenient Parallel Languages , 1991, FPCA.

[13]  Lubomir F. Bic,et al.  Automatic data/program partitioning using the single assignment principle , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[14]  Richard Wolski,et al.  Program Partitioning for NUMA Multiprocessor Computer Systems , 1993, J. Parallel Distributed Comput..

[15]  Yong Meng Teo,et al.  The Effect of Iterative Instructions in Dataflow Computers , 1989, ICPP.

[16]  Vivek Sarkar,et al.  Partitioning and scheduling parallel programs for execution on multiprocessors , 1987 .

[17]  Anoop Gupta,et al.  Exploring The Benefits Of Multiple Hardware Contexts In A Multiprocessor Architecture: Preliminary Results , 1989, The 16th Annual International Symposium on Computer Architecture.

[18]  John R. Rice,et al.  Problems to Test Parallel and Vector Languages -- II , 1990 .

[19]  Walid A. Najjar,et al.  The Initial Performance of a Bottom-Up Clustering Algorithm for Dataflow Graphs , 1993, Architectures and Compilation Techniques for Fine and Medium Grain Parallelism.

[20]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[21]  John Glauert,et al.  SISAL: streams and iteration in a single assignment language. Language reference manual, Version 1. 2. Revision 1 , 1985 .

[22]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[23]  Walid A. Najjar,et al.  Control of loop parallelism in multithreaded code , 1995, PACT.

[24]  David C. Cann,et al.  Compilation techniques for high-performance applicative computation , 1989 .

[25]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[26]  Jesper Vasell,et al.  A Fine-Grain Threaded Abstract Machine , 1994, IFIP PACT.

[27]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[28]  Bob Iannucci Toward a dataflow/von Neumann hybrid architecture , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[29]  Robert A. Iannucci,et al.  Parallel Machines: Parallel Machine Languages , 1990 .

[30]  Walid A. Najjar,et al.  Generation and quantitative evaluation of dataflow clusters , 1993, FPCA '93.

[31]  D. E. Culler,et al.  RESOURCE MANAGEMENT FOR THE TAGGED TOKEN DATAFLOW ARCHITECTURE , 1985 .

[32]  Walid A. Najjar,et al.  An Analysis of Loop Latency in Dataflow Execution , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[33]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[34]  Guang R. Gao,et al.  Building multithreaded architectures with off-the-shelf microprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[35]  Walid A. Najjar,et al.  An evaluation of coarse grain dataflow code generation strategies , 1993, Proceedings of Workshop on Programming Models for Massively Parallel Computers.

[36]  Kenneth R. Traub,et al.  Multi-thread Code Generation for Dataflow Architectures from Non-Strict Programs , 1991, FPCA.

[37]  Rishiyur S. Nikhil Arvind,et al.  Id: a language with implicit parallelism , 1992 .

[38]  Milind Girkar,et al.  Automatic Extraction of Functional Parallelism from Ordinary Programs , 1992, IEEE Trans. Parallel Distributed Syst..

[39]  Arvind,et al.  T: A Multithreaded Massively Parallel Architecture , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.