Learning to optimize federated queries

Query optimization is challenging for any database system, even with a clear understanding of its inner workings. Consider then, query planning for a federation of third-party data sources where little detail is known. This is exactly the challenge of orchestrating data execution and movement faced by Tableau's cross-database joins feature, where the data of a query originates from two or more data sources. In this paper, we present our work on using machine learning techniques to address one of the most fundamental challenges in federated query optimization: the dynamic designation of a federation engine. Our machine learning model learns the performance and data characteristics of a system by extracting features from query plans. We further extend the ability of our model to manipulate database settings on a per query level. Our experimental results demonstrate that we can achieve a speedup of up to 10.7x compared to an existing federated query optimizer.

[1]  Nitesh V. Chawla,et al.  A Black-Box Approach to Query Cardinality Estimation , 2007, CIDR.

[2]  Surajit Chaudhuri,et al.  Robust Estimation of Resource Consumption for SQL Queries using Statistical Techniques , 2012, Proc. VLDB Endow..

[3]  Archana Ganapathi,et al.  Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[4]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[5]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[6]  Eli Upfal,et al.  Learning-based Query Performance Modeling and Prediction , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[7]  Alfons Kemper,et al.  HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[8]  Laura M. Haas,et al.  Cost Models DO Matter: Providing Cost Information for Diverse Data Sources in a Federated System , 1999, VLDB.

[9]  Mary Roth,et al.  Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources , 1997, VLDB.

[10]  Dan Suciu,et al.  The Myria Big Data Management and Analytics System and Cloud Services , 2017, CIDR.

[11]  Rada Chirkova,et al.  Enabling query processing across heterogeneous data models: A survey , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[12]  Alvin Cheung,et al.  Cuttlefish: A Lightweight Primitive for Adaptive Query Processing , 2018, ArXiv.

[13]  Viktor Leis,et al.  How Good Are Query Optimizers, Really? , 2015, Proc. VLDB Endow..

[14]  Paolo Papotti,et al.  RHEEM: Enabling Cross-Platform Data Processing - May The Big Data Be With You! - , 2018, Proc. VLDB Endow..

[15]  Ion Stoica,et al.  Learning to Optimize Join Queries With Deep Reinforcement Learning , 2018, ArXiv.

[16]  Michael Hausenblas,et al.  Apache Drill: Interactive Ad-Hoc Analysis at Scale , 2013, Big Data.

[17]  Patrick Valduriez,et al.  Scaling Access to Heterogeneous Data Sources with DISCO , 1998, IEEE Trans. Knowl. Data Eng..

[18]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[19]  Olga Papaemmanouil,et al.  Deep Reinforcement Learning for Join Order Enumeration , 2018, aiDM@SIGMOD.

[20]  Anastasia Ailamaki,et al.  No data left behind: real-time insights from a complex data ecosystem , 2017, SoCC.

[21]  Andreas Kipf,et al.  Learned Cardinalities: Estimating Correlated Joins with Deep Learning , 2018, CIDR.

[22]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[23]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.

[24]  Michael Stonebraker,et al.  The BigDAWG Polystore System , 2015, SGMD.