Silhouette: Efficient Cloud Configuration Exploration for Large-Scale Analytics

Choosing the best cloud configuration for large-scale data analytics jobs deployed in the cloud can substantially improve their performance and reduce costs. However, current cloud providers offer a wide variety of instance types and customized cluster sizes, making it both time-consuming and costly to pinpoint the optimal cloud configuration. This article presents the design, implementation, and evaluation of Silhouette, a cloud configuration selection framework based on performance models for various large-scale analytics jobs with minimal training overhead. The essence of Silhouette is to build performance prediction models with carefully selected small-scale experiments on small subsets of input data to estimate the performance with entire input data on larger cluster sizes. To reduce the training time and cost, Silhouette incorporates new statistical techniques to select those experiments that yield the best possible information for performance prediction. Moreover, we develop a novel model transformer to convert a prediction model built on one instance type to a different instance type with only one extra experiment, which significantly reduces the training overhead. We evaluate Silhouette with an extensive array of large-scale data analytics jobs on Amazon EC2. Our experimental results have shown convincing evidence that Silhouette is effective in optimizing cloud configuration while saving both training time and costs compared with existing solutions.

[1]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[2]  Suman Nath,et al.  WebPerf: Evaluating What-If Scenarios for Cloud-hosted Web Applications , 2016, SIGCOMM.

[3]  Lei Huang,et al.  Large-Scale Image Processing Research Cloud , 2014, CLOUD 2014.

[4]  Meikel Pöss,et al.  TPC-DS, taking decision support benchmarking to the next level , 2002, SIGMOD '02.

[5]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[6]  B. Langmead,et al.  Cloud computing for genomic data analysis and collaboration , 2018, Nature Reviews Genetics.

[7]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[8]  Ricardo Bianchini,et al.  DejaVu: accelerating resource allocation in virtualized environments , 2012, ASPLOS XVII.

[9]  Tim Menzies,et al.  Arrow: Low-Level Augmented Bayesian Optimization for Finding the Best Cloud VM , 2017, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[10]  Kay Ousterhout,et al.  Architecting for Performance Clarity in Data Analytics Frameworks , 2017 .

[11]  Aaron Klein,et al.  Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets , 2016, AISTATS.

[12]  Brian C. Ross Mutual Information between Discrete and Continuous Data Sets , 2014, PloS one.

[13]  Ion Stoica,et al.  Efficient coflow scheduling with Varys , 2014, SIGCOMM.

[14]  Xi Chen,et al.  CloudScope: Diagnosing and Managing Performance Interference in Multi-tenant Clouds , 2015, 2015 IEEE 23rd International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[15]  Carlo Curino,et al.  PerfOrator: eloquent performance models for Resource Optimization , 2016, SoCC.

[16]  Yanjiao Chen,et al.  Razor: Scaling Backend Capacity for Mobile Applications , 2020, IEEE Transactions on Mobile Computing.

[17]  Olatunji Ruwase,et al.  Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems , 2015, KDD.

[18]  Robert N. M. Watson,et al.  Firmament: Fast, Centralized Cluster Scheduling at Scale , 2016, OSDI.

[19]  Qian Wang,et al.  Searchable Encryption over Feature-Rich Data , 2018, IEEE Transactions on Dependable and Secure Computing.

[20]  Kalina Bontcheva,et al.  GATECloud.net: a platform for large-scale, open-source text processing on the cloud , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[21]  Chuan Wu,et al.  Optimus: an efficient dynamic resource scheduler for deep learning clusters , 2018, EuroSys.

[22]  Saurabh Bagchi,et al.  ICE: An Integrated Configuration Engine for Interference Mitigation in Cloud Services , 2015, 2015 IEEE International Conference on Autonomic Computing.

[23]  S. Silvey Optimal Design: An Introduction to the Theory for Parameter Estimation , 1980 .

[24]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[25]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[26]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[27]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[28]  Chen Wang,et al.  MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs , 2014, Proc. VLDB Endow..

[29]  Srikanth Kandula,et al.  This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Graphene: Packing and Dependency-aware Scheduling for Data-parallel Clusters G: Packing and Dependency-aware Scheduling for Data-parallel Clusters , 2022 .

[30]  Michael J. Freedman,et al.  SLAQ: quality-driven scheduling for distributed machine learning , 2017, SoCC.

[31]  Yanjiao Chen,et al.  Backdoor Attacks and Defenses for Deep Neural Networks in Outsourced Cloud Environments , 2020, IEEE Network.

[32]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[33]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[34]  Kang G. Shin,et al.  Tiresias: A GPU Cluster Manager for Distributed Deep Learning , 2019, NSDI.

[35]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[36]  Anastasia Ailamaki,et al.  PREDIcT: Towards Predicting the Runtime of Large Scale Iterative Analytics , 2013, Proc. VLDB Endow..

[37]  Randy H. Katz,et al.  Selecting the best VM across multiple public clouds: a data-driven performance modeling approach , 2017, SoCC.

[38]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[39]  Valentin Dalibard,et al.  BOAT: Building Auto-Tuners with Structured Bayesian Optimization , 2017, WWW.

[40]  Ion Stoica,et al.  Coflow: a networking abstraction for cluster applications , 2012, HotNets-XI.

[41]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.