Composing and executing parallel data-flow graphs with shell pipes

In this paper we extend the concept of shell pipes to incorporate forks, joins, cycles, and key-value aggregation. These extensions enable the implementation of a class of data-flow computation with strong deterministic properties, and provide a simple yet powerful coordination layer for leveraging multi-language and legacy components for large-scale parallel computation. Concretely, this paper describes the design and implementation of the language extensions in Bourne Again SHell (BASH), and examines the performance of the system using micro and macro benchmarks. The implemented system is shown to scale to thousands of processors, enabling high throughput performance for millions of processing tasks on large commodity compute clusters.

[1]  Otto Ritter,et al.  Design and implementation of a parallel pipe , 1997, OPSR.

[2]  Jonathan Schaeffer,et al.  Rethinking the pipeline as object-oriented states with transformations , 2004, Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, 2004. Proceedings..

[3]  John K. Ousterhout,et al.  Scripting: Higher-Level Programming for the 21st Century , 1998, Computer.

[4]  Farhad Arbab,et al.  Coordination Models and Languages , 1998, Adv. Comput..

[5]  Nicholas Carriero,et al.  Coordination languages and their significance , 1992, CACM.

[6]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[7]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[8]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[9]  Edward A. Lee,et al.  Dataflow process networks , 1995, Proc. IEEE.

[10]  Lutz Prechelt,et al.  An Empirical Comparison of Seven Programming Languages , 2000, Computer.

[11]  Gilles Kahn,et al.  Coroutines and Networks of Parallel Processes , 1977, IFIP Congress.

[12]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[13]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[14]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[15]  Thomas Martyn Parks,et al.  Bounded scheduling of process networks , 1996 .

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Ronald Prescott Loui,et al.  In Praise of Scripting: Real Programming Pragmatism , 2008, Computer.

[18]  Jayadev Misra,et al.  The Orc Programming Language , 2009, FMOODS/FORTE.

[19]  Dennis Ritchie,et al.  The UNIX system: The evolution of the UNIX time-sharing system , 1979, AT&T Bell Laboratories Technical Journal.

[20]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[21]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.