Netco: Cache and I/O Management for Analytics over Disaggregated Stores

We consider a common setting where storage is disaggregated from the compute in data-parallel systems. Colocating caching tiers with the compute machines can reduce load on the interconnect but doing so leads to new resource management challenges. We design a system Netco, which prefetches data into the cache (based on workload predictability), and appropriately divides the cache space and network bandwidth between the prefetches and serving ongoing jobs. Netco makes various decisions (what content to cache, when to cache and how to apportion bandwidth) to support end-to-end optimization goals such as maximizing the number of jobs that meet their service-level objectives (e.g., deadlines). Our implementation of these ideas is available within the open-source Apache HDFS project. Experiments on a public cloud, with production-trace inspired workloads, show that Netco uses up to 5x less remote I/O compared to existing techniques and increases the number of jobs that meet their deadlines up to 80%.

[1]  Luca Faust,et al.  Modern Operating Systems , 2016 .

[2]  Ion Stoica,et al.  Efficient coflow scheduling with Varys , 2014, SIGCOMM.

[3]  Zhe Wu,et al.  CosTLO: Cost-Effective Redundancy for Lower Latency Variance on Cloud Storage Services , 2015, NSDI.

[4]  Ion Stoica,et al.  Coflow: a networking abstraction for cluster applications , 2012, HotNets-XI.

[5]  Gerhard Weikum,et al.  The LRU-K page replacement algorithm for database disk buffering , 1993, SIGMOD Conference.

[6]  Ali Ghodsi,et al.  FairRide: Near-Optimal, Fair Cache Sharing , 2016, NSDI.

[7]  Ming Zhang,et al.  Guaranteeing deadlines for inter-datacenter transfers , 2015, EuroSys.

[8]  Carlo Curino,et al.  Reservation-based Scheduling: If You're Late Don't Blame Us! , 2014, SoCC.

[9]  Srikanth Kandula,et al.  Reoptimizing Data Parallel Computing , 2012, NSDI.

[10]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[11]  Philipp Leitner,et al.  Patterns in the Chaos—A Study of Performance Variation and Predictability in Public IaaS Clouds , 2014, ACM Trans. Internet Techn..

[12]  Mohit Singh,et al.  Sharing Buffer Pool Memory in Multi-Tenant Relational Database-as-a-Service , 2015, Proc. VLDB Endow..

[13]  Kaushik Velusamy,et al.  Modern Operating Systems , 2015 .

[14]  Pasin Manurangsi,et al.  Almost-polynomial ratio ETH-hardness of approximating densest k-subgraph , 2016, STOC.

[15]  Sandy Irani,et al.  Page replacement with multi-size pages and applications to Web caching , 1997, STOC '97.

[16]  Srikanth Kandula,et al.  Leveraging endpoint flexibility in data-intensive clusters , 2013, SIGCOMM.

[17]  Alexandru Iosup,et al.  On the Performance Variability of Production Cloud Services , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[18]  Scott Klein Azure Data Lake Analytics , 2017 .

[19]  Albert G. Greenberg,et al.  Scarlett: coping with skewed content popularity in mapreduce clusters , 2011, EuroSys '11.

[20]  Ymir Vigfusson,et al.  Mithril: mining sporadic associations for cache prefetching , 2017, SoCC.

[21]  Srikanth Kandula,et al.  Jockey: guaranteed job latency in data parallel clusters , 2012, EuroSys '12.

[22]  Herodotos Herodotou,et al.  OctopusFS: A Distributed File System with Tiered Storage Management , 2017, SIGMOD Conference.

[23]  Sanjeev Khanna,et al.  Page replacement for general caching problems , 1999, SODA '99.

[24]  Uriel Feige,et al.  The Dense k -Subgraph Problem , 2001, Algorithmica.

[25]  Aditya Bhaskara,et al.  Detecting high log-densities: an O(n¼) approximation for densest k-subgraph , 2010, STOC '10.

[26]  Richard J. Enbody,et al.  Optimal replacement is NP-hard for nonstandard caches , 2004, IEEE Transactions on Computers.

[27]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[28]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[29]  Srikanth Kandula,et al.  Multi-resource packing for cluster schedulers , 2014, SIGCOMM.

[30]  Ishai Menache,et al.  Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can , 2015, SIGCOMM.

[31]  Nimrod Megiddo,et al.  ARC: A Self-Tuning, Low Overhead Replacement Cache , 2003, FAST.

[32]  Antony I. T. Rowstron,et al.  Decentralized task-aware scheduling for data center networks , 2014, SIGCOMM.

[33]  Jeffrey C. Mogul,et al.  Using predictive prefetching to improve World Wide Web latency , 1996, CCRV.

[34]  Andrea C. Arpaci-Dusseau,et al.  Tombolo: Performance enhancements for cloud storage gateways , 2016, 2016 32nd Symposium on Mass Storage Systems and Technologies (MSST).

[35]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[36]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[37]  Min Zhu,et al.  B4: experience with a globally-deployed software defined wan , 2013, SIGCOMM.

[38]  Johannes Gehrke,et al.  ClouDiA: A Deployment Advisor for Public Clouds , 2012, Proc. VLDB Endow..

[39]  Marty Humphrey,et al.  Auto-scaling to minimize cost and meet application deadlines in cloud workflows , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[40]  David J. DeWitt,et al.  An evaluation of buffer management strategies for relational database systems , 1986, Algorithmica.

[41]  Joseph Naor,et al.  A primal-dual randomized algorithm for weighted paging , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[42]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[43]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[44]  Ali Raza Butt,et al.  CAST: Tiering Storage for Data Analytics in the Cloud , 2015, HPDC.

[45]  Yanhui Geng,et al.  CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark , 2016, SIGCOMM.

[46]  Kannan Ramchandran,et al.  EC-Cache: Load-Balanced, Low-Latency Cluster Caching with Online Erasure Coding , 2016, OSDI.

[47]  Azer Bestavros,et al.  Using speculation to reduce server load and service time on the WWW , 1995, CIKM '95.

[48]  Srikanth Kandula,et al.  Calendaring for wide area networks , 2014, SIGCOMM.

[49]  Pramod Bhatotia,et al.  Orchestrating the Deployment of Computations in the Cloud with Conductor , 2012, NSDI.

[50]  Yuval Rabani,et al.  An improved approximation algorithm for resource allocation , 2011, TALG.

[51]  Ion Stoica,et al.  Efficient Coflow Scheduling Without Prior Knowledge , 2015, SIGCOMM.

[52]  Antony I. T. Rowstron,et al.  IOFlow: a software-defined storage architecture , 2013, SOSP.

[53]  Jia Wang,et al.  A survey of web caching schemes for the Internet , 1999, CCRV.

[54]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[55]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[56]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.