Computation Reuse in Analytics Job Service at Microsoft

Analytics-as-a-service, or analytics job service, is emerging as a new paradigm for data analytics, be it in a cloud environment or within enterprises. In this setting, users are not required to manage or tune their hardware and software infrastructure, and they pay only for the processing resources consumed per job. However, the shared nature of these job services across several users and teams leads to significant overlaps in partial computations, i.e., parts of the processing are duplicated across multiple jobs, thus generating redundant costs. In this paper, we describe a computation reuse framework, coined CLOUDVIEWS, which we built to address the computation overlap problem in Microsoft's SCOPE job service. We present a detailed analysis from our production workloads to motivate the computation overlap problem and the possible gains from computation reuse. The key aspects of our system are the following: (i) we reuse computations by creating materialized views over recurring workloads, i.e., periodically executing jobs that have the same script templates but process new data each time, (ii) we select the views to materialize using a feedback loop that reconciles the compile-time and run-time statistics and gathers precise measures of the utility and cost of each overlapping computation, and (iii) we create materialized views in an online fashion, without requiring an offline phase to materialize the overlapping computations.

[1]  Guoping Wang,et al.  Multi-Query Optimization in MapReduce Framework , 2013, Proc. VLDB Endow..

[2]  Volker Markl,et al.  LEO - DB2's LEarning Optimizer , 2001, VLDB.

[3]  Carlo Curino,et al.  Dependency-Driven Analytics: A Compass for Uncharted Data Oceans , 2017, CIDR.

[4]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[5]  Prasan Roy,et al.  Efficient and extensible algorithms for multi query optimization , 1999, SIGMOD '00.

[6]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[7]  Inderpal Singh Mumick,et al.  Selection of Views to Materialize in a Data Warehouse , 2005, IEEE Trans. Knowl. Data Eng..

[8]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[9]  Ioana Manolescu,et al.  Reuse-based Optimization for Pig Latin , 2016, CIKM.

[10]  Nicolas Bruno,et al.  Continuous Cloud-Scale Query Optimization and Processing , 2013, Proc. VLDB Endow..

[11]  Frederick Reiss,et al.  Main-memory scan sharing for multi-core CPUs , 2008, Proc. VLDB Endow..

[12]  Stratis Viglas,et al.  Recycling in pipelined query evaluation , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[13]  Gustavo Alonso,et al.  MQJoin: Efficient Shared Execution of Main-Memory Joins , 2016, Proc. VLDB Endow..

[14]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[15]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[16]  Goetz Graefe The Cascades Framework for Query Optimization , 1995, IEEE Data Eng. Bull..

[17]  Kamesh Munagala,et al.  ROBUS: Fair Cache Allocation for Data-parallel Workloads , 2015, SIGMOD Conference.

[18]  Hamid Pirahesh,et al.  Robust query processing through progressive optimization , 2004, SIGMOD '04.

[19]  Alekh Jindal,et al.  Thou Shall Not Recompute: Selecting Subexpressions to Materialize at Datacenter Scale , 2018 .

[20]  Viktor Leis,et al.  How Good Are Query Optimizers, Really? , 2015, Proc. VLDB Endow..

[21]  Bin Song,et al.  Kodiak: Leveraging Materialized Views For Very Low-Latency Analytics Over High-Dimensional Web-Scale Data , 2016, Proc. VLDB Endow..

[22]  Marcin Zukowski,et al.  Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS , 2007, VLDB.

[23]  Ashraf Aboulnaga,et al.  ReStore: Reusing Results of MapReduce Jobs , 2012, Proc. VLDB Endow..

[24]  Saikat Guha,et al.  Bootstrapping Privacy Compliance in Big Data Systems , 2014, 2014 IEEE Symposium on Security and Privacy.

[25]  Timos K. Sellis,et al.  Multiple-query optimization , 1988, TODS.

[26]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[27]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[28]  David J. DeWitt,et al.  Proactive re-optimization , 2005, SIGMOD '05.

[29]  Luis Leopoldo Perez,et al.  History-aware query optimization with materialized intermediate views , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[30]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[31]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[32]  Wolfgang Lehner,et al.  Efficient exploitation of similar subexpressions for query processing , 2007, SIGMOD '07.

[33]  Zohra Bellahsene,et al.  A survey of view selection methods , 2012, SGMD.

[34]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[35]  Andrey Balmin,et al.  Dynamically optimizing queries over large scale data platforms , 2014, SIGMOD Conference.

[36]  Srikanth Kandula,et al.  Reoptimizing Data Parallel Computing , 2012, NSDI.

[37]  Xiaodan Wang,et al.  CoScan: cooperative scan sharing in the cloud , 2011, SoCC.

[38]  Gustavo Alonso,et al.  Shared Workload Optimization , 2014, Proc. VLDB Endow..

[39]  Lenin Ravindranath,et al.  Nectar: Automatic Management of Data and Computation in Datacenters , 2010, OSDI.

[40]  Timos K. Sellis,et al.  Data Warehouse Configuration , 1997, VLDB.

[41]  David J. DeWitt,et al.  Efficient mid-query re-optimization of sub-optimal query execution plans , 1998, SIGMOD '98.

[42]  Alon Y. Halevy,et al.  Goods: Organizing Google's Datasets , 2016, SIGMOD Conference.

[43]  Jingren Zhou,et al.  Exploiting Common Subexpressions for Cloud Query Processing , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[44]  Mohamed A. Soliman,et al.  Testing the accuracy of query optimizers , 2012, DBTest '12.