Phoebe: A Learning-based Checkpoint Optimizer

Easy-to-use programming interfaces paired with cloud-scale processing engines have enabled big data system users to author arbitrarily complex analytical jobs over massive volumes of data. However, as the complexity and scale of analytical jobs increase, they encounter a number of unforeseen problems, hotspots with large intermediate data on temporary storage, longer job recovery time after failures, and worse query optimizer estimates being examples of issues that we are facing at Microsoft. To address these issues, we propose Phoebe, an efficient learningbased checkpoint optimizer. Given a set of constraints and an objective function at compile-time, Phoebe is able to determine the decomposition of job plans, and the optimal set of checkpoints to preserve their outputs to durable global storage. Phoebe consists of three machine learning predictors and one optimization module. For each stage of a job, Phoebe makes accurate predictions for: (1) the execution time, (2) the output size, and (3) the start/end time taking into account the inter-stage dependencies. Using these predictions, we formulate checkpoint optimization as an integer programming problem and propose a scalable heuristic algorithm that meets the latency requirement of the production environment. We demonstrate the effectiveness of Phoebe in production workloads, and show that we can free the temporary storage on hotspots by more than 70% and restart failed jobs 68% faster on average with minimum performance impact. Phoebe also illustrates that adding multiple sets of checkpoints is not cost-efficient, which dramatically reduces the complexity of the optimization. PVLDB Reference Format: Yiwen Zhu, Matteo Interlandi, Abhishek Roy, Krishnadhan Das, Hiren Patel, Malay Bag, Hitesh Sharma, and Alekh Jindal. Phoebe: A Learning-based Checkpoint Optimizer. PVLDB, 14(11): 2505 2518, 2021. doi:10.14778/3476249.3476298

[1]  Alekh Jindal,et al.  Microlearner: A fine-grained Learning Optimizer for Big Data Workloads at Microsoft , 2021, 2021 IEEE 37th International Conference on Data Engineering (ICDE).

[2]  Olga Papaemmanouil,et al.  Plan-Structured Deep Neural Network Models for Query Performance Prediction , 2019, Proc. VLDB Endow..

[3]  Carlo Curino,et al.  Hydra: a federated resource manager for data-center scale analytics , 2019, NSDI.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jyoti Leeka,et al.  Incorporating Super-Operators in Big-Data Query Optimizers , 2019, Proc. VLDB Endow..

[6]  Carlo Curino,et al.  Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-based Platforms , 2019, SoCC.

[7]  Srikanth Kandula,et al.  Selectivity Estimation for Range Predicates using Lightweight Models , 2019, Proc. VLDB Endow..

[8]  Krish Shankar,et al.  Azure Machine Learning , 2019 .

[9]  Chris Douglas,et al.  Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics , 2017, SIGMOD Conference.

[10]  Eli Upfal,et al.  Learning-based Query Performance Modeling and Prediction , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[11]  Tim Kraska,et al.  Neo: A Learned Query Optimizer , 2019, Proc. VLDB Endow..

[12]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[13]  Carlo Vercellis,et al.  Stochastic on-line knapsack problems , 1995, Math. Program..

[14]  Andreas Kipf,et al.  Learned Cardinalities: Estimating Correlated Joins with Deep Learning , 2018, CIDR.

[15]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[16]  Magdalena Balazinska,et al.  Learning State Representations for Query Optimization with Deep Reinforcement Learning , 2018, DEEM@SIGMOD.

[17]  Ting Chen,et al.  A selective checkpointing mechanism for query plans in a parallel database system , 2013, 2013 IEEE International Conference on Big Data.

[18]  Magdalena Balazinska,et al.  A latency and fault-tolerance optimizer for online parallel query plans , 2011, SIGMOD '11.

[19]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[20]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[21]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[22]  Jordan Tigani,et al.  Google BigQuery Analytics , 2014 .

[23]  Ling Ding,et al.  RIOS: Runtime Integrated Optimizer for Spark , 2018, SoCC.

[24]  Thomas Neumann,et al.  Taking the Edge off Cardinality Estimation Errors using Incremental Execution , 2013, BTW.

[25]  Tim Kraska,et al.  Cost-Guided Cardinality Estimation: Focus Where it Matters , 2020, 2020 IEEE 36th International Conference on Data Engineering Workshops (ICDEW).

[26]  Carsten Binnig,et al.  DeepDB , 2019, Proc. VLDB Endow..

[27]  Raul Castro Fernandez,et al.  Integrating scale out and fault tolerance in stream processing using operator state management , 2013, SIGMOD '13.

[28]  Xin He,et al.  Flint: batch-interactive data-intensive processing on transient servers , 2016, EuroSys.

[30]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[31]  Samuel Madden,et al.  Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[32]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[33]  Yang Chen,et al.  TR-Spark: Transient Computing for Big Data Analytics , 2016, SoCC.

[34]  J. Little A Proof for the Queuing Formula: L = λW , 1961 .

[35]  Michael Stonebraker,et al.  Fault-tolerance in the Borealis distributed stream processing system , 2005, SIGMOD '05.

[36]  Yiwen Zhu,et al.  Machine Learning at Microsoft with ML.NET , 2019, KDD.

[37]  Alekh Jindal,et al.  Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings , 2020, SIGMOD Conference.

[38]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[39]  Hiren Patel,et al.  Computation Reuse in Analytics Job Service at Microsoft , 2018, SIGMOD Conference.

[40]  Yu Zhang,et al.  Exploiting Depth and Highway Connections in Convolutional Recurrent Deep Neural Networks for Speech Recognition , 2016, INTERSPEECH.

[41]  Alekh Jindal,et al.  Peregrine: Workload Optimization for Cloud Query Engines , 2019, SoCC.

[42]  Hiren Patel,et al.  Towards a Learning Optimizer for Shared Clouds , 2018, Proc. VLDB Endow..

[43]  Francesco Diaz,et al.  Azure Data Lake Store and Azure Data Lake Analytics , 2018 .

[44]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[45]  Sanjay Chawla,et al.  ML-based Cross-Platform Query Optimization , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[46]  Carsten Binnig,et al.  Cost-based Fault-tolerance for Parallel Data Processing , 2015, SIGMOD Conference.

[47]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[48]  Hiren Patel,et al.  Selecting Subexpressions to Materialize at Datacenter Scale , 2018, Proc. VLDB Endow..

[49]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[50]  Albert G. Greenberg,et al.  Fault-tolerant stream processing using a distributed, replicated file system , 2008, Proc. VLDB Endow..

[51]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[52]  Guido Moerkotte,et al.  Preventing Bad Plans by Bounding the Impact of Cardinality Estimation Errors , 2009, Proc. VLDB Endow..

[53]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.