An evaluation of medium-grain dataflow code

In this paper, we study several issues related to the medium grain dataflow model of execution. We present bottom-up compilation of medium grainclusters from a fine grain dataflow graph. We compare thebasic block and thedependence sets algorithms that partition dataflow graphs into clusters. For an extensive set of benchmarks we assess the average number of instructions in a cluster and the reduction in matching operations compared with fine grain dataflow execution. We study the performance of medium grain dataflow when several architectural parameters, such as the number of processors, matching cost, and network latency, are varied.The results indicate that medium grain execution offers a good speedup over the fine grain model, that it is scalable, and tolerates network latency and high matching costs well. Medium grain execution can benefit from a higher output bandwidth of a processor and fainally, a simple superscalar processor with an issue rate of two is sufficient to exploit the internal parallelism of a cluster.

[1]  Richard Wolski,et al.  Program Partitioning for NUMA Multiprocessor Computer Systems , 1993, J. Parallel Distributed Comput..

[2]  John R. Gurd,et al.  Iterative Instructions in the Manchester Dataflow Computer , 1990, IEEE Trans. Parallel Distributed Syst..

[3]  John Glauert,et al.  SISAL: streams and iteration in a single assignment language. Language reference manual, Version 1. 2. Revision 1 , 1985 .

[4]  Kenneth R. Traub,et al.  Multithreading: a revisionist view of dataflow architectures , 1991, ISCA '91.

[5]  Arvind,et al.  T: a multithreaded massively parallel architecture , 1992, ISCA '92.

[6]  V. Gerald Grafe,et al.  The Epsilon-2 Multiprocessor System , 1990, J. Parallel Distributed Comput..

[7]  David E. Culler,et al.  Two Fundamental Limits on Dataflow Multiprocessing , 1993, Architectures and Compilation Techniques for Fine and Medium Grain Parallelism.

[8]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[9]  Toshitsugu Yuba,et al.  An Architecture Of A Dataflow Single Chip Processor , 1989, The 16th Annual International Symposium on Computer Architecture.

[10]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[11]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[12]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[13]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[14]  Bob Iannucci Toward a dataflow/von Neumann hybrid architecture , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[15]  David E. Culler,et al.  Managing parallelism and resources in scientific dataflow programs , 1989 .

[16]  V. G. Grafe,et al.  Compile-time partitioning of a non-strict language with side-effects into sequential threads , 1991 .

[17]  Steven S. Muchnick,et al.  Optimizing compilers for SPARC , 1989 .

[18]  John R. Rice,et al.  Problems to Test Parallel and Vector Languages -- II , 1990 .

[19]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[20]  Mitsuhisa Sato,et al.  Thread-based programming for the EM-4 hybrid dataflow machine , 1992, ISCA '92.

[21]  Scott A. Mahlke,et al.  Comparing static and dynamic code scheduling for multiple-instruction-issue processors , 1991, MICRO 24.

[22]  T. Yuba,et al.  An architecture of a dataflow single chip processor , 1989, ISCA '89.

[23]  Walid A. Najjar,et al.  The Initial Performance of a Bottom-Up Clustering Algorithm for Dataflow Graphs , 1993, Architectures and Compilation Techniques for Fine and Medium Grain Parallelism.

[24]  David E. Culler,et al.  Global analysis for partitioning non-strict programs into sequential threads , 1992, LFP '92.

[25]  David E. Culler,et al.  Compiler-Controlled Multithreading for Lenient Parallel Languages , 1991, FPCA.

[26]  Lubomir F. Bic,et al.  Automatic data/program partitioning using the single assignment principle , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[27]  Arvind,et al.  T: A Multithreaded Massively Parallel Architecture , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[28]  Vivek Sarkar,et al.  Partitioning and scheduling parallel programs for execution on multiprocessors , 1987 .

[29]  Lucas Roh Idias Dataaow Machine Simulator , 1992 .

[30]  Walid A. Najjar,et al.  A quantitative analysis of locality in dataflow programs , 1991, MICRO 24.

[31]  David E. Culler,et al.  The Explicit Token Store , 1990, J. Parallel Distributed Comput..

[32]  Ian Watson,et al.  The Manchester prototype dataflow computer , 1985, CACM.

[33]  V. Gerald Grafe,et al.  Compile-time partitioning of a non-strict language into sequential threads , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[34]  Kenneth R. Traub,et al.  Multi-thread Code Generation for Dataflow Architectures from Non-Strict Programs , 1991, FPCA.

[35]  Rishiyur S. Nikhil Arvind,et al.  Id: a language with implicit parallelism , 1992 .

[36]  Milind Girkar,et al.  Automatic Extraction of Functional Parallelism from Ordinary Programs , 1992, IEEE Trans. Parallel Distributed Syst..

[37]  Walid A. Najjar,et al.  An analysis of loop latency in dataflow execution , 1992, ISCA '92.