Implicit Parallelism through Deep Language Embedding

The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-Defined Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataflow APIs that are tightly coupled to their underlying runtime engine. Expressing data analysis algorithms with complex data and control flow structure using such APIs reveals a number of limitations that impede programmer's productivity. In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and affects programmer's productivity. Instead, we argue that an approach based on deeply embedding the APIs in a host language can address the shortcomings of current data analysis languages. To demonstrate this, we propose a language for complex data analysis embedded in Scala, which (i) allows for declarative specification of dataflows and (ii) hides the notion of data-parallelism and distributed runtime behind a suitable intermediate representation. We describe a compiler pipeline that facilitates efficient data-parallel processing without imposing runtime engine-bound syntactic or semantic restrictions on the structure of the input programs. We present a series of experiments with two state-of-the-art systems that demonstrate the optimization potential of our approach.

[1]  Alvin Cheung,et al.  Sloth: being lazy is a virtue (when issuing database queries) , 2014, SIGMOD Conference.

[2]  Torsten Grust,et al.  The Flatter, the Better: Query Compilation Based on the Flattening Transformation , 2015, SIGMOD Conference.

[3]  Jiaxing Zhang,et al.  Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE , 2012, OSDI.

[4]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[5]  Eugene Burmako,et al.  Scala macros: let our powers combine!: on how rich syntax and static types work with metaprogramming , 2013, SCALA@ECOOP.

[6]  Volker Markl,et al.  Spinning Fast Iterative Data Flows , 2012, Proc. VLDB Endow..

[7]  Martin Odersky,et al.  Lightweight modular staging , 2012, Commun. ACM.

[8]  Tiark Rompf,et al.  Jet: An Embedded DSL for High Performance Big Data Processing , 2012 .

[9]  Andrey Balmin,et al.  Jaql , 2011, Proc. VLDB Endow..

[10]  Michael Isard,et al.  Steno: automatic optimization of declarative queries , 2011, PLDI '11.

[11]  Kunle Olukotun,et al.  A domain-specific approach to heterogeneous parallelism , 2011, PPoPP '11.

[12]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[13]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[14]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[15]  Michael Isard,et al.  Distributed aggregation for data-parallel computing: interfaces and implementations , 2009, SOSP '09.

[16]  Guy L. Steele,et al.  Organizing functional code for parallel execution or, foldl and foldr considered slightly harmful , 2009, ICFP.

[17]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[18]  Torsten Grust,et al.  FERRY: database-supported program execution , 2009, SIGMOD Conference.

[19]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[20]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[21]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[22]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[23]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[24]  Brian Beckman,et al.  LINQ: reconciling object, relations and XML in the .NET framework , 2006, SIGMOD Conference.

[25]  Torsten Grust,et al.  How to Comprehend Queries Functionally , 1999, Journal of Intelligent Information Systems.

[26]  Erik Poll,et al.  Algebra of Programming by Richard Bird and Oege de Moor, Prentice Hall, 1996 (dated 1997). , 1999 .

[27]  Richard S. Bird,et al.  Algebra of programming , 1997, Prentice Hall International series in computer science.

[28]  Limsoon Wong,et al.  Principles of Programming with Complex Objects and Collection Types , 1995, Theor. Comput. Sci..

[29]  Dan Suciu,et al.  On Two Forms of Structural Recursion , 1995, ICDT.

[30]  Kyuseok Shim,et al.  Including Group-By in Query Optimization , 1994, VLDB.

[31]  Simon L. Peyton Jones,et al.  Cheap Deforestation in Practice: An Optimizer for Haskell , 1994, IFIP Congress.

[32]  Dan Suciu,et al.  Comprehension syntax , 1994, SGMD.

[33]  Joachim Lambek,et al.  Least fixpoints of endofunctors of cartesian closed categories , 1993, Mathematical Structures in Computer Science.

[34]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[35]  Goetz Graefe,et al.  The Volcano optimizer generator: extensibility and efficient search , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[36]  Jack A. Orenstein,et al.  The ObjectStore database system , 1991, CACM.

[37]  Maarten M. Fokkinga,et al.  Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire , 1991, FPCA.

[38]  Hartmut Ehrig,et al.  Fundamentals of Algebraic Specification 1: Equations and Initial Semantics , 1985 .

[39]  Won Kim,et al.  On optimizing an SQL-like nested query , 1982, TODS.

[40]  Matthias Jarke,et al.  Query processing strategies in the PASCAL/R relational database management system , 1982, SIGMOD '82.

[41]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[42]  J. Lambek A fixpoint theorem for complete categories , 1968 .