SparkCruise: Workload Optimization in Managed Spark Clusters at Microsoft

Today cloud companies offer fully managed Spark services. This has made it easy to onboard new customers but has also increased the volume of users and their workload sizes. However, both cloud providers and users lack the tools and time to optimize these massiveworkloads. To solve this problem,we designed SparkCruise that can help understand and optimize workload instances by adding a workload-driven feedback loop to the Spark query optimizer. In this paper, we present our approach to collecting and representing Spark query workloads and use it to improve the overall performance on the workload, all without requiring any access to user data. These methods scale with the number of workloads and apply learned feedback in an online fashion. We explain one specific workload optimization developed for computation reuse. We also share the detailed analysis of production Spark workloads and contrast them with the corresponding analysis of TPC-DS benchmark. To the best of our knowledge, this is the first study to share the analysis of large-scale production Spark SQL workloads. PVLDB Reference Format: Abhishek Roy, Alekh Jindal, Priyanka Gomatam, Xiating Ouyang, Ashit Gosalia, Nishkam Ravi, Swinky Mann, and Prakhar Jain. SparkCruise: Workload Optimization in Managed Spark Clusters at Microsoft. PVLDB, 14(12): 3122 3134, 2021. doi:10.14778/3476311.3476388

[1]  Tim Kraska,et al.  SageDB: A Learned Database System , 2019, CIDR.

[2]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[3]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[4]  Marcos Dias de Assunção,et al.  Apache Spark , 2019, Encyclopedia of Big Data Technologies.

[5]  Hiren Patel,et al.  Selecting Subexpressions to Materialize at Datacenter Scale , 2018, Proc. VLDB Endow..

[6]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[7]  Carlo Curino,et al.  SparkCruise: Handsfree Computation Reuse in Spark , 2019, Proc. VLDB Endow..

[8]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[9]  Jeffrey F. Naughton,et al.  DIAMetrics: Benchmarking Query Engines at Scale , 2020, Proc. VLDB Endow..

[10]  Tim Kraska,et al.  Steering Query Optimizers: A Practical Take on Big Data Workloads , 2021, SIGMOD Conference.

[11]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[12]  Alekh Jindal,et al.  Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings , 2020, SIGMOD Conference.

[13]  Hiren Patel,et al.  Computation Reuse in Analytics Job Service at Microsoft , 2018, SIGMOD Conference.

[14]  Alekh Jindal,et al.  Peregrine: Workload Optimization for Cloud Query Engines , 2019, SoCC.

[15]  Hiren Patel,et al.  Towards a Learning Optimizer for Shared Clouds , 2018, Proc. VLDB Endow..

[16]  Leila Etaati Azure Databricks , 2019, Machine Learning with Microsoft Technologies.

[17]  Guoliang Li,et al.  Automatic View Generation with Deep Learning and Reinforcement Learning , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).