Implicit Parallelism through Deep Language Embedding

Parallel collection processing based on second-order functions such as map and reduce has been widely adopted for scalable data analysis. Initially popularized by Google, over the past decade this programming paradigm has found its way in the core APIs of parallel dataflow engines such as Hadoop's MapReduce, Spark's RDDs, and Flink's DataSets. We review programming patterns typical of these APIs and discuss how they relate to the underlying parallel execution model. We argue that fixing the abstraction leaks exposed by these patterns will reduce the cost of data analysis due to improved programmer productivity. To achieve that, we first revisit the algebraic foundations of parallel collection processing. Based on that, we propose a simplified API that (i) provides proper support for nested collection processing and (ii) alleviates the need of certain second-order primitives through comprehensions -- a declarative syntax akin to SQL. Finally, we present a metaprogramming pipeline that performs algebraic rewrites and physical optimizations which allow us to target parallel dataflow engines like Spark and Flink with competitive performance.

[1]  J. Lambek A fixpoint theorem for complete categories , 1968 .

[2]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[3]  Matthias Jarke,et al.  Query processing strategies in the PASCAL/R relational database management system , 1982, SIGMOD '82.

[4]  Won Kim,et al.  On optimizing an SQL-like nested query , 1982, TODS.

[5]  Hartmut Ehrig,et al.  Fundamentals of Algebraic Specification 1: Equations and Initial Semantics , 1985 .

[6]  Maarten M. Fokkinga,et al.  Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire , 1991, FPCA.

[7]  Jack A. Orenstein,et al.  The ObjectStore database system , 1991, CACM.

[8]  Goetz Graefe,et al.  The Volcano optimizer generator: extensibility and efficient search , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[9]  Joachim Lambek,et al.  Least fixpoints of endofunctors of cartesian closed categories , 1993, Mathematical Structures in Computer Science.

[10]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[11]  Kyuseok Shim,et al.  Including Group-By in Query Optimization , 1994, VLDB.

[12]  Simon L. Peyton Jones,et al.  Cheap Deforestation in Practice: An Optimizer for Haskell , 1994, IFIP Congress.

[13]  Dan Suciu,et al.  Comprehension syntax , 1994, SGMD.

[14]  Dan Suciu,et al.  On Two Forms of Structural Recursion , 1995, ICDT.

[15]  Limsoon Wong,et al.  Principles of Programming with Complex Objects and Collection Types , 1995, Theor. Comput. Sci..

[16]  Richard S. Bird,et al.  Algebra of programming , 1997, Prentice Hall International series in computer science.

[17]  Torsten Grust,et al.  Comprehending queries , 1999, Ausgezeichnete Informatikdissertationen.

[18]  Torsten Grust,et al.  How to Comprehend Queries Functionally , 1999, Journal of Intelligent Information Systems.

[19]  Brian Beckman,et al.  LINQ: reconciling object, relations and XML in the .NET framework , 2006, SIGMOD Conference.

[20]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[21]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[22]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[23]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[24]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[25]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[26]  Michael Isard,et al.  Distributed aggregation for data-parallel computing: interfaces and implementations , 2009, SOSP '09.

[27]  Torsten Grust,et al.  FERRY: database-supported program execution , 2009, SIGMOD Conference.

[28]  Guy L. Steele,et al.  Organizing functional code for parallel execution or, foldl and foldr considered slightly harmful , 2009, ICFP.

[29]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[30]  Martin Odersky,et al.  Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs , 2010, GPCE '10.

[31]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[32]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[33]  Andrey Balmin,et al.  Jaql , 2011, Proc. VLDB Endow..

[34]  Michael Isard,et al.  Steno: automatic optimization of declarative queries , 2011, PLDI '11.

[35]  Kunle Olukotun,et al.  A domain-specific approach to heterogeneous parallelism , 2011, PPoPP '11.

[36]  Volker Markl,et al.  Spinning Fast Iterative Data Flows , 2012, Proc. VLDB Endow..

[37]  Tiark Rompf,et al.  Jet: An Embedded DSL for High Performance Big Data Processing , 2012 .

[38]  Eugene Burmako,et al.  Scala macros: let our powers combine!: on how rich syntax and static types work with metaprogramming , 2013, SCALA@ECOOP.

[39]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[40]  Alvin Cheung,et al.  Sloth: being lazy is a virtue (when issuing database queries) , 2014, SIGMOD Conference.

[41]  Volker Markl,et al.  Implicit Parallelism through Deep Language Embedding , 2015, SIGMOD Conference.

[42]  Torsten Grust,et al.  The Flatter, the Better: Query Compilation Based on the Flattening Transformation , 2015, SIGMOD Conference.

[43]  Jiaxing Zhang,et al.  Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE , 2012, OSDI.