How to Architect a Query Compiler, Revisited

To leverage modern hardware platforms to their fullest, more and more database systems embrace compilation of query plans to native code. In the research community, there is an ongoing debate about the best way to architect such query compilers. This is perceived to be a difficult task, requiring techniques fundamentally different from traditional interpreted query execution. We aim to contribute to this discussion by drawing attention to an old but underappreciated idea known as Futamura projections, which fundamentally link interpreters and compilers. Guided by this idea, we demonstrate that efficient query compilation can actually be very simple, using techniques that are no more difficult than writing a query interpreter in a high-level language. Moreover, we demonstrate how intricate compilation patterns that were previously used to justify multiple compiler passes can be realized in one single, straightforward, generation pass. Key examples are injection of specialized index structures, data representation changes such as string dictionaries, and various kinds of code motion to reduce the amount of work on the critical path. We present LB2: a high-level query compiler developed in this style that performs on par with, and sometimes beats, the best compiled query engines on the standard TPC-H benchmark.

[1]  Thomas Neumann,et al.  Efficiently Compiling Efficient Query Plans for Modern Hardware , 2011, Proc. VLDB Endow..

[2]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[3]  Nada Amin,et al.  Functional pearl: a SQL to C compiler in 500 lines of code , 2015, ICFP.

[4]  Kunle Olukotun,et al.  Delite , 2014, ACM Trans. Embed. Comput. Syst..

[5]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[6]  Charles W. Bachman,et al.  The programmer as navigator , 1973, CACM.

[7]  Viktor Leis,et al.  Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age , 2014, SIGMOD Conference.

[8]  Goetz Graefe,et al.  Volcano - An Extensible and Parallel Query Evaluation System , 1994, IEEE Trans. Knowl. Data Eng..

[9]  Anastasia Ailamaki,et al.  Adaptive Query Processing on RAW Data , 2014, Proc. VLDB Endow..

[10]  Andreas Kipf,et al.  High-Performance Geospatial Analytics in HyPerSpace , 2016, SIGMOD Conference.

[11]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[12]  Samuel Madden,et al.  Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware , 2016, Proc. VLDB Endow..

[13]  Kunle Olukotun,et al.  Go Meta! A Case for Generative Programming and DSLs in Performance Critical Systems , 2015, SNAPL.

[14]  Kunle Olukotun,et al.  Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[15]  Carsten Binnig,et al.  An Architecture for Compiling UDF-centric Workflows , 2015, Proc. VLDB Endow..

[16]  Neil D. Jones,et al.  An introduction to partial evaluation , 1996, CSUR.

[17]  Rick Greer,et al.  Daytona and the fourth-generation language Cymbal , 1999, SIGMOD '99.

[18]  Amir Shaikhha,et al.  How to Architect a Query Compiler , 2016, SIGMOD Conference.

[19]  Thomas Heinis,et al.  Just-In-Time Data Virtualization: Lightweight Data Management with ViDa , 2015, CIDR.

[20]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[21]  Martin Odersky,et al.  Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs , 2010, GPCE '10.

[22]  Anastasia Ailamaki,et al.  H2O: a hands-free adaptive store , 2014, SIGMOD Conference.

[23]  Torsten Grust,et al.  Precision Performance Surgery for PostgreSQL: LLVM-based Expression Compilation, Just in Time , 2016, Proc. VLDB Endow..

[24]  Tiark Rompf,et al.  Jet: An Embedded DSL for High Performance Big Data Processing , 2012 .

[25]  Anastasia Ailamaki,et al.  Fast Queries Over Heterogeneous Data Through Engine Customization , 2016, Proc. VLDB Endow..

[26]  J. Nelson,et al.  Radish : Compiling Efficient Query Plans for Distributed Shared Memory , 2014 .

[27]  Irving L. Traiger,et al.  System R: relational approach to database management , 1976, TODS.

[28]  Tiark Rompf,et al.  On supporting compilation in spatial query engines: (vision paper) , 2016, SIGSPATIAL/GIS.

[29]  Kunle Olukotun,et al.  Flare: Native Compilation for Heterogeneous Workloads in Apache Spark , 2017, ArXiv.

[30]  Christoph Koch,et al.  DBToaster: A SQL Compiler for High-Performance Delta Processing in Main-Memory Databases , 2009, Proc. VLDB Endow..

[31]  Peter Sestoft,et al.  Partial evaluation and automatic program generation , 1993, Prentice Hall international series in computer science.

[32]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[33]  Tamiya Onodera,et al.  Workload characterization and optimization of TPC-H queries on Apache Spark , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[34]  Martin Odersky,et al.  Unifying functional and object-oriented programming with Scala , 2014, Commun. ACM.

[35]  David J. DeWitt,et al.  Managing Intra-operator Parallelism in Parallel Database Systems , 1995, VLDB.

[36]  Craig Freedman,et al.  Hekaton: SQL server's memory-optimized OLTP engine , 2013, SIGMOD '13.

[37]  F. E. A Relational Model of Data Large Shared Data Banks , 2000 .

[38]  Eugene Sharygin,et al.  Runtime Specialization of PostgreSQL Query Executor , 2017, Ershov Informatics Conference.

[39]  Hamid Pirahesh,et al.  Compiled Query Execution Engine using JVM , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[40]  Yoshihiko Futamura,et al.  Partial Evaluation of Computation Process--An Approach to a Compiler-Compiler , 1999, High. Order Symb. Comput..

[41]  Kunle Olukotun,et al.  OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning , 2011, ICML.

[42]  Ippokratis Pandis,et al.  Impala: Eine moderne, quellen-offene SQL Engine für Hadoop , 2016 .

[43]  Stratis Viglas,et al.  Generating code for holistic query evaluation , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).