KEA: Tuning an Exabyte-Scale Data Infrastructure

Microsoft's internal big-data infrastructure is one of the largest in the world---with over 300k machines running billions of tasks from over 0.6M daily jobs. Operating this infrastructure is a costly and complex endeavor, and efficiency is paramount. In fact, for over 15 years, a dedicated engineering team has tuned almost every aspect of this infrastructure, achieving state-of-the-art efficiency (>60% average CPU utilization across all clusters). Despite rich telemetry and strong expertise, faced with evolving hardware/software/workloads this manual tuning approach had reached its limit---we had plateaued. In this paper, we present KEA, a multi-year effort to automate our tuning processes to be fully data/model-driven. KEA leverages a mix of domain knowledge and principled data science to capture the essence of our cluster dynamic behavior in a set of machine learning (ML) models based on collected system data. These models power automated optimization procedures for parameter tuning, and inform our leadership in critical decisions around engineering and capacity management (such as hardware and data center design, software investments, etc.). We combine "observational'' tuning (i.e., using models to predict system behavior without direct experimentation) with judicious use of "flighting'' (i.e., conservative testing in production). This allows us to support a broad range of applications that we discuss in this paper. KEA continuously tunes our cluster configurations and is on track to save Microsoft tens of millions of dollars per year. At the best of our knowledge, this paper is the first to discuss research challenges and practical learnings that emerge when tuning an exabyte-scale data infrastructure.

[1]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[2]  Srikanth Kandula,et al.  Recurring job optimization in scope , 2012, SIGMOD Conference.

[3]  Michael C. Huang,et al.  Hadoop Configuration Tuning With Ensemble Modeling and Metaheuristic Optimization , 2018, IEEE Access.

[4]  Lin Ma,et al.  External vs. Internal: An Essay on Machine Learning Agents for Autonomous Database Management Systems , 2019, IEEE Data Eng. Bull..

[5]  Shivaram Venkataraman,et al.  Too Many Knobs to Tune? Towards Faster Database Tuning by Pre-selecting Important Knobs , 2020, HotStorage.

[6]  Ricardo Bianchini,et al.  Toward ML-centric cloud platforms , 2020, Commun. ACM.

[7]  Srikanth Kandula,et al.  Jockey: guaranteed job latency in data parallel clusters , 2012, EuroSys '12.

[8]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[9]  P. Holland Statistics and Causal Inference , 1985 .

[10]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[11]  Arif Merchant,et al.  Janus: Optimal Flash Provisioning for Cloud Storage Workloads , 2013, USENIX Annual Technical Conference.

[12]  Wei Zheng,et al.  Automatic configuration of internet services , 2007, EuroSys '07.

[13]  Jordan Tigani,et al.  Google BigQuery Analytics , 2014 .

[14]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[15]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.

[16]  Eshcar Hillel,et al.  Predicting Execution Bottlenecks in Map-Reduce Clusters , 2012, HotCloud.

[17]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[18]  Chris Douglas,et al.  Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics , 2017, SIGMOD Conference.

[19]  Jaliya Ekanayake,et al.  Hyper Dimension Shuffle: Efficient Data Repartition at Petabyte Scale in Scope , 2019, Proc. VLDB Endow..

[20]  Li Zhang,et al.  MRONLINE: MapReduce online performance tuning , 2014, HPDC '14.

[21]  T. Moscibroda,et al.  Protean: VM Allocation Service at Scale , 2020, OSDI.

[22]  Stefan Wager,et al.  Estimation and Inference of Heterogeneous Treatment Effects using Random Forests , 2015, Journal of the American Statistical Association.

[23]  Matt J. Kusner,et al.  Bayesian Optimization with Inequality Constraints , 2014, ICML.

[24]  Johannes Gehrke,et al.  Reinforcement learning for bandwidth estimation and congestion control in real-time communications , 2019, ArXiv.

[25]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[26]  Parthasarathy Ranganathan,et al.  The Datacenter as a Computer: Designing Warehouse-Scale Machines, Third Edition , 2018, The Datacenter as a Computer.

[27]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[28]  Xiaoyu Chen,et al.  JetScope: Reliable and Interactive Analytics at Cloud Scale , 2015, Proc. VLDB Endow..

[29]  Arif Merchant,et al.  Take me to your leader! Online Optimization of Distributed Storage Configurations , 2015, Proc. VLDB Endow..

[30]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[31]  Harald C. Gall,et al.  Software Engineering for Machine Learning: A Case Study , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[32]  Frank Dehne,et al.  Automatic, On-Line Tuning of YARN Container Memory and CPU Parameters , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[33]  Jingren Zhou,et al.  Incorporating partitioning and parallel plans into the SCOPE optimizer , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[34]  J. Mockus Bayesian Approach to Global Optimization: Theory and Applications , 1989 .

[35]  Carlo Curino,et al.  Unearthing inter-job dependencies for better cluster scheduling , 2020, OSDI.

[36]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[37]  Nicolas Bruno,et al.  Continuous Cloud-Scale Query Optimization and Processing , 2013, Proc. VLDB Endow..

[38]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[39]  Kushal Datta,et al.  Gunther: Search-Based Auto-Tuning of MapReduce , 2013, Euro-Par.

[40]  Kamal K.c.,et al.  Performance Tuning of MapReduce Programs , 2015 .

[41]  Carlo Curino,et al.  Hydra: a federated resource manager for data-center scale analytics , 2019, NSDI.

[42]  Carlo Curino,et al.  Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-based Platforms , 2019, SoCC.

[43]  Wei Lin,et al.  Microsoft Bing Peking University , 2022 .

[44]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[45]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[46]  Dorothea Heiss-Czedik,et al.  An Introduction to Genetic Algorithms. , 1997, Artificial Life.

[47]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[48]  Christoforos E. Kozyrakis,et al.  Selecta: Heterogeneous Cloud Storage Configuration for Data Analytics , 2018, USENIX Annual Technical Conference.

[49]  Jon Howell,et al.  Slicer: Auto-Sharding for Datacenter Applications , 2016, OSDI.

[50]  A. Owen A robust hybrid of lasso and ridge regression , 2006 .

[51]  Alekh Jindal,et al.  Peregrine: Workload Optimization for Cloud Query Engines , 2019, SoCC.

[52]  Ke Zhou,et al.  An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning , 2019, SIGMOD Conference.

[53]  Wei Lin,et al.  Advanced partitioning techniques for massively distributed computation , 2012, SIGMOD Conference.

[54]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[55]  Carlo Curino,et al.  MLOS: An Infrastructure for Automated Software Performance Engineering , 2020, DEEM@SIGMOD.

[56]  Carlo Curino,et al.  Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters , 2015, USENIX Annual Technical Conference.

[57]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.