UDF to SQL translation through compositional lazy inductive synthesis

Many data processing systems allow SQL queries that call user-defined functions (UDFs) written in conventional programming languages. While such SQL extensions provide convenience and flexibility to users, queries involving UDFs are not as efficient as their pure SQL counterparts that invoke SQL’s highly-optimized built-in functions. Motivated by this problem, we propose a new technique for translating SQL queries with UDFs to pure SQL expressions. Unlike prior work in this space, our method is not based on syntactic rewrite rules and can handle a much more general class of UDFs. At a high-level, our method is based on counterexample-guided inductive synthesis (CEGIS) but employs a novel compositional strategy that decomposes the synthesis task into simpler sub-problems. However, because there is no universal decomposition strategy that works for all UDFs, we propose a novel lazy inductive synthesis approach that generates a sequence of decompositions that correspond to increasingly harder inductive synthesis problems. Because most realistic UDF-to-SQL translation tasks are amenable to a fine-grained decomposition strategy, our lazy inductive synthesis method scales significantly better than traditional CEGIS. We have implemented our proposed technique in a tool called CLIS for optimizing Spark SQL programs containing Scala UDFs. To evaluate CLIS, we manually study 100 randomly selected UDFs and find that 63 of them can be expressed in pure SQL. Our evaluation on these 63 UDFs shows that CLIS can automatically synthesize equivalent SQL expressions in 92% of the cases and that it can solve 2.4× more benchmarks compared to a baseline that does not use our compositional approach. We also show that CLIS yields an average speed-up of 3.5× for individual UDFs and 1.3× to 3.1× in terms of end-to-end application performance.

[1]  Isil Dillig,et al.  Relational verification using reinforcement learning , 2019, Proc. ACM Program. Lang..

[2]  Volker Markl,et al.  Peeking into the optimization of data flow programs with MapReduce-style UDFs , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[3]  Daniel Kroening,et al.  A Tool for Checking ANSI-C Programs , 2004, TACAS.

[4]  Maaz Bin Safeer Ahmad,et al.  Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications , 2018, SIGMOD Conference.

[5]  Ulf Leser,et al.  Versatile optimization of UDF-heavy data flows with sofa , 2014, SIGMOD Conference.

[6]  Alvin Cheung,et al.  Optimizing database-backed applications with query synthesis , 2013, PLDI.

[7]  Consolidation of queries with user-defined functions , 2014, PLDI.

[8]  S. Sudarshan,et al.  Decorrelation of user defined function invocations in queries , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[9]  Maaz Bin Safeer Ahmad,et al.  Automatically translating image processing libraries to halide , 2019, ACM Trans. Graph..

[10]  Karthik Ramachandra,et al.  Aggify: Lifting the Curse of Cursor Loops using Custom Aggregates , 2020, SIGMOD Conference.

[11]  Sumit Gulwani,et al.  Compositional Program Synthesis from Natural Language and Examples , 2015, IJCAI.

[12]  S. Sudarshan,et al.  DBridge: Translating Imperative Code to SQL , 2017, SIGMOD Conference.

[13]  Jiaxing Zhang,et al.  Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE , 2012, OSDI.

[14]  William R. Cook,et al.  Interprocedural query extraction for transparent persistence , 2008, OOPSLA.

[15]  Sanjit A. Seshia,et al.  Combinatorial sketching for finite programs , 2006, ASPLOS XII.

[16]  Armin Biere,et al.  Bounded model checking , 2003, Adv. Comput..

[17]  Kunle Olukotun,et al.  Flare: Optimizing Apache Spark with Native Compilation for Scale-Up Architectures and Medium-Size Data , 2018, OSDI.

[18]  William R. Cook,et al.  Extracting queries by static analysis of transparent persistence , 2007, POPL '07.

[19]  Isil Dillig,et al.  Synthesizing database programs for schema refactoring , 2019, PLDI.

[20]  Armando Solar-Lezama,et al.  Program synthesis from polymorphic refinement types , 2015, PLDI.

[21]  Kwanghyun Park,et al.  BlackMagic: Automatic Inlining of Scalar UDFs into SQL Queries with Froid , 2019, Proc. VLDB Endow..

[22]  Alexander Aiken,et al.  Stochastic superoptimization , 2012, ASPLOS '13.

[23]  Carsten Binnig,et al.  An Architecture for Compiling UDF-centric Workflows , 2015, Proc. VLDB Endow..

[24]  S. Sudarshan,et al.  Extracting Equivalent SQL from Imperative Code in Database Applications , 2016, SIGMOD Conference.

[25]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[26]  Shuvendu K. Lahiri,et al.  SYMDIFF: A Language-Agnostic Semantic Diff Tool for Imperative Programs , 2012, CAV.

[27]  Martin Odersky,et al.  Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs , 2010, GPCE '10.

[28]  Isil Dillig,et al.  Synthesizing JIT Compilers for In-Kernel DSLs , 2020, CAV.

[29]  Akash Lal,et al.  Optimizing Big-Data Queries Using Program Synthesis , 2017, SOSP.

[30]  Isil Dillig,et al.  Trinity: An Extensible Synthesis Framework for Data Science , 2019, Proc. VLDB Endow..

[31]  Kwanghyun Park,et al.  Froid: Optimization of Imperative Programs in a Relational Database , 2017, Proc. VLDB Endow..

[32]  Alvin Cheung,et al.  Packet Transactions: High-Level Programming for Line-Rate Switches , 2015, SIGCOMM.

[33]  Sumit Gulwani,et al.  FlashMeta: a framework for inductive program synthesis , 2015, OOPSLA.

[34]  Yanjun Wang,et al.  Reconciling enumerative and deductive program synthesis , 2020, PLDI.

[35]  Rajeev Alur,et al.  Syntax-guided synthesis , 2013, 2013 Formal Methods in Computer-Aided Design.

[36]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.