Tagged Dataflow: a Formal Model for Iterative Map-Reduce

In this paper, we consider the recent iterative extensions of the Map-Reduce framework and we argue that they would greatly benefit from the research work that was conducted in the area of dataflow computing more than thirty years ago. In particular, we suggest that thetagged-dataflowmodel of computation can be used as the formal framework be- hind existing and future iterative generalizations of Map- Reduce. Moreover, we present various applications in which the tagged model gives elegant solutions with increased par- allelism. The tagged-dataflow approach for iterative Map- Reduce creates a number of interesting research challenges which deserve further investigation, such as the requirement for a more sophisticated fault tolerance model.

[1]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[2]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[3]  Ian Watson,et al.  The Manchester prototype dataflow computer , 1985, CACM.

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  William W. Wadge,et al.  Higher-order functional languages and intensional logic , 1999, Journal of Functional Programming.

[6]  Volker Markl,et al.  Spinning Fast Iterative Data Flows , 2012, Proc. VLDB Endow..

[7]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[8]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[9]  Michael Isard,et al.  Differential Dataflow , 2013, CIDR.

[10]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[11]  Ian Watson,et al.  A prototype data flow computer with token labelling , 1899 .

[12]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[13]  Iskender Agi,et al.  GLU: A High-Level System for Granular Data-Parallel Programming , 1997, Concurr. Pract. Exp..

[14]  Gilles Dowek,et al.  Principles of programming languages , 1981, Prentice Hall International Series in Computer Science.

[15]  Samson Abramsky,et al.  A Generalized Kahn Principle for Abstract Asynchronous Networks , 1989, Mathematical Foundations of Programming Semantics.

[16]  Arvind,et al.  Tagged token dataflow architecture , 1983 .

[17]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[18]  Jeffrey D. Ullman,et al.  Cluster Computing, Recursion and Datalog , 2010, Datalog.

[19]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[20]  William W. Wadge,et al.  Lucid, the dataflow programming language , 1985 .

[21]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '98.

[22]  Ali A. G. Yaghi An intensional implementation technique for functional languages , 1984 .

[23]  Scott Shenker,et al.  Shark: fast data analysis using coarse-grained distributed memory , 2012, SIGMOD Conference.

[24]  Jeffrey D. Ullman,et al.  Transitive closure and recursive Datalog implemented on clusters , 2012, EDBT '12.

[25]  A. L. Davis,et al.  The architecture and system method of DDM1: A recursively structured Data Driven Machine , 1978, ISCA '78.

[26]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[27]  Jeffrey D. Ullman,et al.  Map-reduce extensions and recursive queries , 2011, EDBT/ICDT '11.