Peregrine: Workload Optimization for Cloud Query Engines

Database administrators (DBAs) were traditionally responsible for optimizing the on-premise database workloads. However, with the rise of cloud data services, where cloud providers offer fully managed data processing capabilities, the role of a DBA is completely missing. At the same time, optimizing query workloads is becoming increasingly important for reducing the total costs of operation and making data processing economically viable in the cloud. This paper revisits workload optimization in the context of these emerging cloud-based data services. We observe that the missing DBA in these newer data services has affected both the end users and the system developers: users have workload optimization as a major pain point while the system developers are now tasked with supporting a large base of cloud users. We present Peregrine, a workload optimization platform for cloud query engines that we have been developing for the big data analytics infrastructure at Microsoft. Peregrine makes three major contributions: (i) a novel way of representing query workloads that is agnostic to the query engine and is general enough to describe a large variety of workloads, (ii) a categorization of the typical workload patterns, derived from production workloads at Microsoft, and the corresponding workload optimizations possible in each category, and (iii) a prescription for adding workload-awareness to a query engine, via the notion of query annotations that are served to the query engine at compile time. We discuss a case study of Peregrine using two optimizations over two query engines, namely Scope and Spark. Peregrine has helped cut the time to develop new workload optimization features from years to months, benefiting the research teams, the product teams, and the customers at Microsoft.

[1]  G. Graefe The Cascades Framework for Query Optimization , 1995, IEEE Data Eng. Bull..

[2]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[3]  Surajit Chaudhuri,et al.  Automating Statistics Management for Query Optimizers , 2001, IEEE Trans. Knowl. Data Eng..

[4]  Vivek R. Narasayya,et al.  Self-Tuning Database Systems: A Decade of Progress , 2007, VLDB.

[5]  Archana Ganapathi,et al.  Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[6]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[7]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[8]  Surajit Chaudhuri,et al.  AutoAdmin Project at Microsoft Research: Lessons Learned , 2011, IEEE Data Eng. Bull..

[9]  Andrew J. Mason,et al.  OpenSolver - An Open Source Add-in to Solve Linear and Integer Progammes in Excel , 2011, OR.

[10]  Eli Upfal,et al.  Learning-based Query Performance Modeling and Prediction , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[11]  Surajit Chaudhuri,et al.  Robust Estimation of Resource Consumption for SQL Queries using Statistical Techniques , 2012, Proc. VLDB Endow..

[12]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[13]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[14]  Srikanth Kandula,et al.  Reoptimizing Data Parallel Computing , 2012, NSDI.

[15]  Justin J. Miller,et al.  Graph Database Applications and Concepts with Neo4j , 2013 .

[16]  Michael Stonebraker,et al.  VERTEXICA: Your Relational Friend for Graph Analytics! , 2014, Proc. VLDB Endow..

[17]  Aditya G. Parameswaran,et al.  SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics , 2015, Proc. VLDB Endow..

[18]  AzureML Team,et al.  AzureML: Anatomy of a machine learning service , 2016, PAPIs.

[19]  Sherif Talaat Azure SQL Database , 2015 .

[20]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[21]  Ioana Manolescu,et al.  Reuse-based Optimization for Pig Latin , 2016, CIKM.

[22]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[23]  Carlo Curino,et al.  PerfOrator: eloquent performance models for Resource Optimization , 2016, SoCC.

[24]  Manasi Vartak,et al.  ModelDB: a system for machine learning model management , 2016, HILDA '16.

[25]  Chris Douglas,et al.  Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics , 2017, SIGMOD Conference.

[26]  Sriram Rao,et al.  Dhalion: Self-Regulating Stream Processing in Heron , 2017, Proc. VLDB Endow..

[27]  Carlo Curino,et al.  Dependency-Driven Analytics: A Compass for Uncharted Data Oceans , 2017, CIDR.

[28]  Scott Klein Azure Data Lake Analytics , 2017 .

[29]  Surajit Chaudhuri,et al.  Plan Stitch: Harnessing the Best of Many Plans , 2018, Proc. VLDB Endow..

[30]  Magdalena Balazinska,et al.  Learning State Representations for Query Optimization with Deep Reinforcement Learning , 2018, DEEM@SIGMOD.

[31]  Michael Mitzenmacher,et al.  A Model for Learned Bloom Filters and Related Structures , 2018, ArXiv.

[32]  Srikanth Kandula,et al.  Netco: Cache and I/O Management for Analytics over Disaggregated Stores , 2018, SoCC.

[33]  François Chollet,et al.  Keras: The Python Deep Learning library , 2018 .

[34]  Hiren Patel,et al.  Computation Reuse in Analytics Job Service at Microsoft , 2018, SIGMOD Conference.

[35]  Ali Ghodsi,et al.  Accelerating the Machine Learning Lifecycle with MLflow , 2018, IEEE Data Eng. Bull..

[36]  Hiren Patel,et al.  Selecting Subexpressions to Materialize at Datacenter Scale , 2018, Proc. VLDB Endow..

[37]  Alekh Jindal,et al.  Query and Resource Optimization: Bridging the Gap , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[38]  Alekh Jindal,et al.  Thou Shall Not Recompute: Selecting Subexpressions to Materialize at Datacenter Scale , 2018 .

[39]  Hiren Patel,et al.  Towards a Learning Optimizer for Shared Clouds , 2018, Proc. VLDB Endow..

[40]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[41]  Carlo Curino,et al.  SparkCruise: Handsfree Computation Reuse in Spark , 2019, Proc. VLDB Endow..

[42]  Carlo Curino,et al.  Peering through the Dark: An Owl's View of Inter-job Dependencies and Jobs' Impact in Shared Clusters , 2019, SIGMOD Conference.

[43]  P. Abbeel,et al.  Selectivity Estimation with Deep Likelihood Models , 2019, ArXiv.

[44]  Tim Kraska,et al.  Neo: A Learned Query Optimizer , 2019, Proc. VLDB Endow..

[45]  Tim Kraska,et al.  VizML: A Machine Learning Approach to Visualization Recommendation , 2018, CHI.

[46]  Shrainik Jain,et al.  Database-Agnostic Workload Management , 2018, CIDR.

[47]  Surajit Chaudhuri,et al.  AI Meets AI: Leveraging Query Executions to Improve Index Recommendations , 2019, SIGMOD Conference.

[48]  Srikanth Kandula,et al.  Selectivity Estimation for Range Predicates using Lightweight Models , 2019, Proc. VLDB Endow..