Production Experiences from Computation Reuse at Microsoft

Massive data processing infrastructures are commonplace in modern data-driven enterprises. They facilitate data engineers in building scalable data pipelines over shared datasets. Unfortunately, data engineers often end up building pipelines that have portions of their computations common across other pipelines over the same set of shared datasets. Consolidating these data pipelines is therefore crucial for eliminating redundancies and improving production efficiency, thus saving significant operational costs. We had built CloudViews for automatic computation reuse in Cosmos big data workloads at Microsoft. CloudViews added a feedback loop in the SCOPE query engine to learn from past workloads and opportunistically materialize and reuse common computations as part of query processing in future SCOPE jobs — all completely automatic and transparent to the users. In this paper, we describe our production experiences with CloudViews. We first describe the data preparation process in Cosmos and show how computation reuse naturally augments this process. This is because computation reuse prepares data further into more shareable datasets that can improve the performance and efficiency of subsequent processing. We then discuss the usage and impact of CloudViews on our production clusters and describe many of the operational challenges that we have faced so far. Results from our current production deployment over a two month window show that the cumulative latency of jobs improved by 34%, with a median improvement of 15%, and the total processing time reduced by 37%, indicating better customer experience and lower operational costs for these workloads.

[1]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[2]  Chris Douglas,et al.  Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics , 2017, SIGMOD Conference.

[3]  Oded Shmueli,et al.  Improved Cardinality Estimation by Learning Queries Containment Rates , 2019, EDBT.

[4]  Surajit Chaudhuri,et al.  Bitvector-aware Query Optimization for Decision Support Queries , 2020, SIGMOD Conference.

[5]  William R. Harris,et al.  SPES: A Two-Stage Query Equivalence Verifier , 2020, ArXiv.

[6]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[7]  Alekh Jindal,et al.  Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings , 2020, SIGMOD Conference.

[8]  William R. Harris,et al.  Automated Verification of Query Equivalence Using Satisfiability Modulo Theories , 2019, Proc. VLDB Endow..

[9]  Philip A. Bernstein,et al.  Query containment in entity SQL , 2013, SIGMOD '13.

[10]  Hiren Patel,et al.  Computation Reuse in Analytics Job Service at Microsoft , 2018, SIGMOD Conference.

[11]  Alekh Jindal,et al.  AutoToken: Predicting Peak Parallelism for Big Data Analytics at Microsoft , 2020, Proc. VLDB Endow..

[12]  Alvin Cheung,et al.  Cosette: An Automated Prover for SQL , 2017, CIDR.

[13]  Alekh Jindal,et al.  Peregrine: Workload Optimization for Cloud Query Engines , 2019, SoCC.

[14]  Praveen Kumar,et al.  Automated generation of materialized views in Oracle , 2020, Proc. VLDB Endow..

[15]  Hiren Patel,et al.  Towards a Learning Optimizer for Shared Clouds , 2018, Proc. VLDB Endow..

[16]  Alekh Jindal Applied Research Lessons from CloudViews Project , 2020, SIGMOD Rec..

[17]  Alvin Cheung,et al.  Axiomatic Foundations and Algorithms for Deciding Semantic Equivalences of SQL Queries , 2018, Proc. VLDB Endow..

[18]  Carlo Curino,et al.  Unearthing inter-job dependencies for better cluster scheduling , 2020, OSDI.

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  Jeyhun Karimov,et al.  AStream: Ad-hoc Shared Stream Processing , 2019, SIGMOD Conference.

[21]  Marcos Dias de Assunção,et al.  Apache Spark , 2019, Encyclopedia of Big Data Technologies.

[22]  Srikanth Kandula,et al.  Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters , 2016, SIGMOD Conference.

[23]  Chen Li,et al.  Tempura , 2020, Proc. VLDB Endow..

[24]  Carlo Curino,et al.  SparkCruise: Handsfree Computation Reuse in Spark , 2019, Proc. VLDB Endow..

[25]  Alekh Jindal,et al.  Towards Plan-aware Resource Allocation in Serverless Query Processing , 2020, HotCloud.

[26]  Inderpal Singh Mumick,et al.  Selection of views to materialize in a data warehouse , 1997, IEEE Transactions on Knowledge and Data Engineering.

[27]  Alekh Jindal,et al.  Thou Shall Not Recompute: Selecting Subexpressions to Materialize at Datacenter Scale , 2018 .

[28]  Aditya G. Parameswaran,et al.  Helix: Holistic Optimization for Accelerating Iterative Machine Learning , 2018, Proc. VLDB Endow..

[29]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[30]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[31]  Guoliang Li,et al.  Automatic View Generation with Deep Learning and Reinforcement Learning , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[32]  Alekh Jindal,et al.  Microlearner: A fine-grained Learning Optimizer for Big Data Workloads at Microsoft , 2021, 2021 IEEE 37th International Conference on Data Engineering (ICDE).

[33]  Prasan Roy,et al.  Efficient and extensible algorithms for multi query optimization , 1999, SIGMOD '00.