Selecta: Heterogeneous Cloud Storage Configuration for Data Analytics

Data analytics are an important class of data-intensive workloads on public cloud services. However, selecting the right compute and storage configuration for these applications is difficult as the space of available options is large and the interactions between options are complex. Moreover, the different data streams accessed by analytics workloads have distinct characteristics that may be better served by different types of storage devices. We present Selecta, a tool that recommends nearoptimal configurations of cloud compute and storage resources for data analytics workloads. Selecta uses latent factor collaborative filtering to predict how an application will perform across different configurations, based on sparse data collected by profiling training workloads. We evaluate Selecta with over one hundred Spark SQL and ML applications, showing that Selecta chooses a near-optimal performance configuration (within 10% of optimal) with 94% probability and a near-optimal cost configuration with 80% probability. We also use Selecta to draw significant insights about cloud storage systems, including the performance-cost efficiency of NVMe Flash devices, the need for cloud storage with support for fine-grain capacity and bandwidth allocation, and the motivation for end-to-end storage optimizations.

[1]  Michael J. Freedman,et al.  From application requests to virtual IOPs: provisioned key-value storage with Libra , 2014, EuroSys '14.

[2]  Eric Anderson,et al.  Proceedings of the Fast 2002 Conference on File and Storage Technologies Hippodrome: Running Circles around Storage Administration , 2022 .

[3]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[4]  Andrew Warfield,et al.  Decibel: Isolation and Sharing in Disaggregated Rack-Scale Storage , 2017, NSDI.

[5]  Shih-Kun Huang,et al.  BigExplorer: A configuration recommendation system for big data platform , 2016, 2016 Conference on Technologies and Applications of Artificial Intelligence (TAAI).

[6]  Arif Merchant,et al.  Janus: Optimal Flash Provisioning for Cloud Storage Workloads , 2013, USENIX Annual Technical Conference.

[7]  A. Hemanth THE HADOOP DISTRIBUTED FILE SYSTEM: BALANCING PORTABILTY , 2013 .

[8]  Christos Faloutsos,et al.  Using Utility to Provision Storage Systems , 2008, FAST.

[9]  Srikanth Kandula,et al.  Reoptimizing Data Parallel Computing , 2012, NSDI.

[10]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[11]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[12]  Christoforos E. Kozyrakis,et al.  Understanding Ephemeral Storage for Serverless Analytics , 2018, USENIX Annual Technical Conference.

[13]  Anastasia Ailamaki,et al.  Same Queries, Different Data: Can we Predict Query Performance? , 2012 .

[14]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[15]  Randy H. Katz,et al.  Selecting the best VM across multiple public clouds: a data-driven performance modeling approach , 2017, SoCC.

[16]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[17]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[18]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[19]  Andrea C. Arpaci-Dusseau,et al.  Serverless Computation with OpenLambda , 2016, HotCloud.

[20]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[21]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[22]  Kai Shen,et al.  FIOS: a fair, efficient flash I/O scheduler , 2012, FAST.

[23]  Anne-Marie Kermarrec,et al.  ProteusTM: Abstraction Meets Performance in Transactional Memory , 2016, ASPLOS.

[24]  Dick H. J. Epema,et al.  Towards Machine Learning-Based Auto-tuning of MapReduce , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.

[25]  Scott Shenker,et al.  Disk-Locality in Datacenter Computing Considered Irrelevant , 2011, HotOS.

[26]  Dingxiang Zou,et al.  Monitoring and Instrumentation for Underground Excavation , 2017 .

[27]  Yehuda Koren,et al.  Modeling relationships at multiple scales to improve accuracy of large recommender systems , 2007, KDD '07.

[28]  Kai Shen,et al.  FlashFQ: A Fair Queueing I/O Scheduler for Flash-Based SSDs , 2013, USENIX Annual Technical Conference.

[29]  鈴木 良規,et al.  二次正則化分類学習のためのLeave-one-out cross-validationの高速化(情報論的学習理論ワークショップ(IBIS2014)) , 2014 .

[30]  Sangyeun Cho,et al.  The Multi-streamed Solid-State Drive , 2014, HotStorage.

[31]  Srikanth Kandula,et al.  Jockey: guaranteed job latency in data parallel clusters , 2012, EuroSys '12.

[32]  Dirk Beyer,et al.  Designing for Disasters , 2004, FAST.

[33]  Jorge-Arnulfo Quiané-Ruiz,et al.  Runtime measurements in the cloud , 2010, Proc. VLDB Endow..

[34]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[35]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[36]  Nicolas Hug,et al.  Surprise: A Python library for recommender systems , 2020, J. Open Source Softw..

[37]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[38]  Ryan Stutsman,et al.  Crail : A High-Performance I / O Architecture for Distributed Data Processing , .

[39]  Nikolas Ioannou,et al.  On The [Ir]relevance of Network Performance for Data Processing , 2016, HotCloud.

[40]  Jason Cong,et al.  Doppio: I/O-Aware Performance Analysis, Modeling and Optimization for In-memory Computing Framework , 2018, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[41]  Irfan Ahmad,et al.  Pesto: online storage performance management in virtualized datacenters , 2011, SoCC.

[42]  Robert M. Bell,et al.  The BellKor 2008 Solution to the Netflix Prize , 2008 .

[43]  Javier González,et al.  LightNVM: The Linux Open-Channel SSD Subsystem , 2017, FAST.

[44]  Nikolas Ioannou,et al.  Crail: A High-Performance I/O Architecture for Distributed Data Processing , 2017, IEEE Data Eng. Bull..

[45]  Arif Merchant,et al.  Minerva: An automated resource provisioning tool for large-scale storage systems , 2001, TOCS.

[46]  Christoforos E. Kozyrakis,et al.  ReFlex: Remote Flash ≈ Local Flash , 2017, ASPLOS.

[47]  Antony I. T. Rowstron,et al.  Bridging the tenant-provider gap in cloud services , 2012, SoCC '12.

[48]  Antony I. T. Rowstron,et al.  Migrating server storage to SSDs: analysis of tradeoffs , 2009, EuroSys '09.