论文信息 - Implicit Parallelism through Deep Language Embedding - 字舞流文

Implicit Parallelism through Deep Language Embedding

The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-Defined Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataflow APIs that are tightly coupled to their underlying runtime engine. Expressing data analysis algorithms with complex data and control flow structure using such APIs reveals a number of limitations that impede programmer's productivity. In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and affects programmer's productivity. Instead, we argue that an approach based on deeply embedding the APIs in a host language can address the shortcomings of current data analysis languages. To demonstrate this, we propose a language for complex data analysis embedded in Scala, which (i) allows for declarative specification of dataflows and (ii) hides the notion of data-parallelism and distributed runtime behind a suitable intermediate representation. We describe a compiler pipeline that facilitates efficient data-parallel processing without imposing runtime engine-bound syntactic or semantic restrictions on the structure of the input programs. We present a series of experiments with two state-of-the-art systems that demonstrate the optimization potential of our approach.

Alexander B. Alexandrov | O. Kao | V. Markl | Asterios Katsifodimos | L. Thamsen | Andreas Kunft | F. Schüler | T. Herb

[1] Alvin Cheung,et al. Sloth: being lazy is a virtue (when issuing database queries) , 2014, SIGMOD Conference.

[2] Torsten Grust,et al. The Flatter, the Better: Query Compilation Based on the Flattening Transformation , 2015, SIGMOD Conference.

[3] Jiaxing Zhang,et al. Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE , 2012, OSDI.

[4] Felix Naumann,et al. The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[5] Eugene Burmako,et al. Scala macros: let our powers combine!: on how rich syntax and static types work with metaprogramming , 2013, SCALA@ECOOP.

[6] Volker Markl,et al. Spinning Fast Iterative Data Flows , 2012, Proc. VLDB Endow..

[7] Martin Odersky,et al. Lightweight modular staging , 2012, Commun. ACM.

[8] Tiark Rompf,et al. Jet: An Embedded DSL for High Performance Big Data Processing , 2012 .

[9] Andrey Balmin,et al. Jaql , 2011, Proc. VLDB Endow..

[10] Michael Isard,et al. Steno: automatic optimization of declarative queries , 2011, PLDI '11.

[11] Kunle Olukotun,et al. A domain-specific approach to heterogeneous parallelism , 2011, PPoPP '11.

[12] Michael D. Ernst,et al. HaLoop , 2010, Proc. VLDB Endow..

[13] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[14] Krishna P. Gummadi,et al. Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[15] Michael Isard,et al. Distributed aggregation for data-parallel computing: interfaces and implementations , 2009, SOSP '09.

[16] Guy L. Steele,et al. Organizing functional code for parallel execution or, foldl and foldr considered slightly harmful , 2009, ICFP.

[17] Pete Wyckoff,et al. Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[18] Torsten Grust,et al. FERRY: database-supported program execution , 2009, SIGMOD Conference.

[19] Michael Isard,et al. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[20] Jingren Zhou,et al. SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[21] Ravi Kumar,et al. Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[22] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[23] Scott Boag,et al. XQuery 1.0 : An XML Query Language , 2007 .

[24] Brian Beckman,et al. LINQ: reconciling object, relations and XML in the .NET framework , 2006, SIGMOD Conference.

[25] Torsten Grust,et al. How to Comprehend Queries Functionally , 1999, Journal of Intelligent Information Systems.

[26] Erik Poll,et al. Algebra of Programming by Richard Bird and Oege de Moor, Prentice Hall, 1996 (dated 1997). , 1999 .

[27] Richard S. Bird,et al. Algebra of programming , 1997, Prentice Hall International series in computer science.

[28] Limsoon Wong,et al. Principles of Programming with Complex Objects and Collection Types , 1995, Theor. Comput. Sci..

[29] Dan Suciu,et al. On Two Forms of Structural Recursion , 1995, ICDT.

[30] Kyuseok Shim,et al. Including Group-By in Query Optimization , 1994, VLDB.

[31] Simon L. Peyton Jones,et al. Cheap Deforestation in Practice: An Optimizer for Haskell , 1994, IFIP Congress.

[32] Dan Suciu,et al. Comprehension syntax , 1994, SGMD.

[33] Joachim Lambek,et al. Least fixpoints of endofunctors of cartesian closed categories , 1993, Mathematical Structures in Computer Science.

[34] Goetz Graefe,et al. Query evaluation techniques for large databases , 1993, CSUR.

[35] Goetz Graefe,et al. The Volcano optimizer generator: extensibility and efficient search , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[36] Jack A. Orenstein,et al. The ObjectStore database system , 1991, CACM.

[37] Maarten M. Fokkinga,et al. Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire , 1991, FPCA.

[38] Hartmut Ehrig,et al. Fundamentals of Algebraic Specification 1: Equations and Initial Semantics , 1985 .

[39] Won Kim,et al. On optimizing an SQL-like nested query , 1982, TODS.

[40] Matthias Jarke,et al. Query processing strategies in the PASCAL/R relational database management system , 1982, SIGMOD '82.

[41] Patricia G. Selinger,et al. Access path selection in a relational database management system , 1979, SIGMOD '79.

[42] J. Lambek. A fixpoint theorem for complete categories , 1968 .