Efficient Deep Learning Pipelines for Accurate Cost Estimations Over Large Scale Query Workload

The use of deep learning models for forecasting the resource consumption patterns of SQL queries have recently been a popular area of study. While these models have demonstrated promising accuracy, training them over large scale industry workloads are expensive. Space inefficiencies of encoding techniques over large numbers of queries and excessive padding used to enforce shape consistency across diverse query plans implies 1) longer model training time and 2) the need for expensive, scaled up infrastructure to support batched training. In turn, we developed Prestroid, a tree convolution based data science pipeline that accurately predicts resource consumption patterns of query traces, but at a much lower cost. We evaluated our pipeline over 19K Presto OLAP queries, on a data lake of more than 20PB of data from Grab. Experimental results imply that our pipeline outperforms benchmarks on predictive accuracy, contributing to more precise resource prediction for large-scale workloads, yet also reduces per-batch memory footprint by 13.5x and per-epoch training time by 3.45x. We demonstrate direct cost savings of up to 13.2x for large batched model training over Microsoft Azure VMs.

[1]  Eli Upfal,et al.  The Case for Predictive Database Systems: Opportunities and Challenges , 2011, CIDR.

[2]  Magdalena Balazinska,et al.  An Empirical Analysis of Deep Learning for Cardinality Estimation , 2019, ArXiv.

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[5]  Anastasia Ailamaki,et al.  Same Queries, Different Data: Can We Predict Runtime Performance? , 2012, 2012 IEEE 28th International Conference on Data Engineering Workshops.

[6]  Archana Ganapathi,et al.  Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[7]  Magdalena Balazinska,et al.  Learning State Representations for Query Optimization with Deep Reinforcement Learning , 2018, DEEM@SIGMOD.

[8]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[9]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[10]  Ekaba Bisong Google BigQuery , 2019, Building Machine Learning and Deep Learning Models on Google Cloud Platform.

[11]  Gang Chen,et al.  A New Approach to Compute CNNs for Extremely Large Images , 2017, CIKM.

[12]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[13]  Wolfgang Lehner,et al.  Cardinality estimation with local deep learning models , 2019, aiDM@SIGMOD.

[14]  Yijun Yu,et al.  Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks , 2017, AAAI Workshops.

[15]  Alekh Jindal,et al.  AutoToken: Predicting Peak Parallelism for Big Data Analytics at Microsoft , 2020, Proc. VLDB Endow..

[16]  Tim Kraska,et al.  Neo: A Learned Query Optimizer , 2019, Proc. VLDB Endow..

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Andreas Kipf,et al.  Learned Cardinalities: Estimating Correlated Joins with Deep Learning , 2018, CIDR.

[19]  Rafael D. C. Santos,et al.  Text Mining Applied to SQL Queries: A Case Study for the SDSS SkyServer , 2015, SIMBig.

[20]  Carlo Curino,et al.  PerfOrator: eloquent performance models for Resource Optimization , 2016, SoCC.

[21]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[22]  Guoliang Li,et al.  An End-to-End Learning-based Cost Estimator , 2019, Proc. VLDB Endow..

[23]  Rachel Pottinger,et al.  Facilitating SQL Query Composition and Analysis , 2020, SIGMOD Conference.

[24]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[25]  Olga Papaemmanouil,et al.  Towards a Hands-Free Query Optimizer through Deep Learning , 2018, CIDR.

[26]  Lingxiao Jiang,et al.  Bilateral Dependency Neural Networks for Cross-Language Algorithm Classification , 2019, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[27]  Chetan Gupta,et al.  PQR: Predicting Query Execution Times for Autonomous Workload Management , 2008, 2008 International Conference on Autonomic Computing.

[28]  David Phillips,et al.  Presto: SQL on Everything , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[29]  Tim Kraska,et al.  Bao: Learning to Steer Query Optimizers , 2020, ArXiv.

[30]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[31]  Chang Dong Yoo,et al.  Fast and Efficient Image Quality Enhancement via Desubpixel Convolutional Neural Networks , 2018, ECCV Workshops.

[32]  Rathijit Sen,et al.  Characterizing Resource Sensitivity of Database Workloads , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[33]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[34]  Fei Yang,et al.  Efficient Segmentation: Learning Downsampling Near Semantic Boundaries , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Rajkumar Buyya,et al.  Dynamically scaling applications in the cloud , 2011, CCRV.

[36]  Eli Upfal,et al.  Learning-based Query Performance Modeling and Prediction , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[37]  Bingsheng He,et al.  Comet: batched stream processing for data intensive distributed computing , 2010, SoCC '10.

[38]  Thomas F. Wenisch,et al.  A Top-Down Approach to Achieving Performance Predictability in Database Systems , 2017, SIGMOD Conference.

[39]  Barzan Mozafari,et al.  QuickSel: Quick Selectivity Learning with Mixture Models , 2018, SIGMOD Conference.

[40]  Jeffrey F. Naughton,et al.  Predicting query execution time: Are optimizer cost models really unusable? , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).