Distributed filaments: efficient fine-grain parallelism on a cluster of workstations

A fine-grain parallel program is one in which processes are typically small ranging from a few to a few hundred instructions. Fine-grain parallelism arises naturally in many situations such as iterative grid computations recursive fork/join programs the bodies of parallel FOR loops and the implicit parallelism in functional or dataflow languages. It is useful both to describe massively parallel computations and as a target for code generation by compilers. However fine-grain parallelism has long been thought to be inefficient due to the overheads of process creation context switching, and synchronization. This paper describes a software kernel. Distributed Filaments (DF) that implements fine-grain parallelism both portably and efficiently on a workstation cluster DF runs on existing off-the-shelf hardware and software. It has a simple interface so it is easy to use. DF achieves e ciency by using stateless threads on each node overlapping communication and computation, employing a new reliable datagram communication protocol and automatically balancing the work generated by fork/join computations.

[1]  Willy Zwaenepoel,et al.  The distributed V kernel and its performance for diskless workstations , 1983, SOSP '83.

[2]  Nicholas Carriero,et al.  Distributed data structures in Linda , 1986, POPL '86.

[3]  Gregory R. Andrews,et al.  An overview of the SR language and implementation , 1988, TOPL.

[4]  Robert H. Thomas,et al.  The Uniform System: An approach to runtime support for large scale shared memory parallel processors , 1988, ICPP.

[5]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[6]  Brett D. Fleisch,et al.  Mirage: a coherent distributed shared memory design , 1989, SOSP '89.

[7]  Thomas E. Anderson,et al.  The performance implications of thread management alternatives for shared-memory multiprocessors , 1989, SIGMETRICS '89.

[8]  Henri E. Bal,et al.  Experience with distributed programming in Orca , 1990, Proceedings. 1990 International Conference on Computer Languages.

[9]  Peter A. Buhr,et al.  The μsystem: Providing light‐weight concurrency on shared‐memory multiprocessor computers running UNIX , 1990, Softw. Pract. Exp..

[10]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[11]  David Chaiken,et al.  Latency Tolerance through Multithreading in Large-Scale Multiprocessors , 1991 .

[12]  P. Dasgupta,et al.  The Clouds distributed operating system , 1991, Computer.

[13]  Laxmikant V. Kalé,et al.  Supporting Machine Independent Programming on Diverse Parallel Architectures , 1991, ICPP.

[14]  D. Culler,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[15]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[16]  John Zahorjan,et al.  Chores: enhanced run-time support for shared-memory parallel computing , 1993, TOCS.

[17]  Anoop Gupta,et al.  Scaling parallel programs for multiprocessors: methodology and examples , 1993, Computer.

[18]  Seth Copen Goldstein,et al.  TAM - A Compiler Controlled Threaded Abstract Machine , 1993, J. Parallel Distributed Comput..

[19]  Henry M. Levy,et al.  Limits to low-latency communication on high-speed networks , 1993, TOCS.

[20]  Brian N. Bershad,et al.  The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[21]  James R. Larus,et al.  Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[22]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[23]  Vincent W. Freeh,et al.  A Comparison of Implicit and Explicit Parallel Programming , 1996, J. Parallel Distributed Comput..