Managing Tail Latency in Datacenter-Scale File Systems Under Production Constraints

Distributed file systems often exhibit high tail latencies, especially in large-scale datacenters and in the presence of competing (and possibly higher priority) workloads. This paper introduces techniques for managing tail latencies in these systems, while addressing the practical challenges inherent in production datacenters (e.g., hardware heterogeneity, interference from other workloads, the need to maximize simplicity and maintainability). We implement our techniques in a scalable distributed file system (an extension of HDFS) used in production at Microsoft. Our evaluation uses 70k servers in 3 datacenters, and shows that our techniques reduce tail latency significantly for production workloads.

[1]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[2]  D. Andersen,et al.  A Fast Array of Wimpy Nodes , 2008 .

[3]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[4]  Michael I. Jordan,et al.  The SCADS Director: Scaling a Distributed Storage System Under Stringent Performance Requirements , 2011, FAST.

[5]  Mor Harchol-Balter,et al.  PriorityMeister: Tail Latency QoS for Shared Networked Storage , 2014, SoCC.

[6]  Zhengping Qian,et al.  Pado: A Data Processing Engine for Harnessing Transient Resources in Datacenters , 2017, EuroSys.

[7]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[8]  T. S. Eugene Ng,et al.  Understanding the effects and implications of compute node related failures in hadoop , 2012, HPDC '12.

[9]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[10]  Marco Canini,et al.  Rein: Taming Tail Latency in Key-Value Stores via Multiget Scheduling , 2017, EuroSys.

[11]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[12]  Riyaz Jamadar,et al.  Dynamic Slot Allocation Optimization Framework for MapReduce Clusters , 2016 .

[13]  GhemawatSanjay,et al.  The Google file system , 2003 .

[14]  Ricardo Bianchini,et al.  History-Based Harvesting of Spare Cycles and Storage in Large-Scale Datacenters , 2016, OSDI.

[15]  Anees Shaikh,et al.  Performance Isolation and Fairness for Multi-Tenant Cloud Storage , 2012, OSDI.

[16]  Andrea C. Arpaci-Dusseau,et al.  Reducing File System Tail Latencies with Chopper , 2015, FAST.

[17]  Andrew A. Chien,et al.  The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments , 2016, FAST.

[18]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[19]  Adam Wierman,et al.  Hopper: Decentralized Speculation-aware Cluster Scheduling at Scale , 2015, SIGCOMM.

[20]  Rodrigo Fonseca,et al.  Retro: Targeted Resource Management in Multi-tenant Distributed Systems , 2015, NSDI.

[21]  Jie Xu,et al.  Adaptive Speculation for Efficient Internetware Application Execution in Clouds , 2018, ACM Trans. Internet Techn..

[22]  Emin Gün Sirer,et al.  HyperDex: a distributed, searchable key-value store , 2012, SIGCOMM '12.

[23]  Anja Feldmann,et al.  C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection , 2015, NSDI.

[24]  Sameh Elnikety,et al.  PerfIso: Performance Isolation for Commercial Latency-Sensitive Services , 2018, USENIX Annual Technical Conference.

[25]  Srikanth Kandula,et al.  Speeding up distributed request-response workflows , 2013, SIGCOMM.

[26]  Wei Jin,et al.  Interposed proportional sharing for a storage service utility , 2004, SIGMETRICS '04/Performance '04.

[27]  Ricardo Bianchini,et al.  Scaling Distributed File Systems in Resource-Harvesting Datacenters , 2017, USENIX Annual Technical Conference.

[28]  Michael J. Freedman,et al.  Object Storage on CRAQ: High-Throughput Chain Replication for Read-Mostly Workloads , 2009, USENIX Annual Technical Conference.

[29]  Zhen Cao,et al.  On the Performance Variation in Modern Storage Stacks , 2017, FAST.

[30]  Dahlia Malkhi,et al.  CORFU: A Shared Log Design for Flash Clusters , 2012, NSDI.

[31]  Anand Sivasubramaniam,et al.  Storage performance virtualization via throughput and latency control , 2005, 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[32]  Andrea C. Arpaci-Dusseau,et al.  Split-level I/O scheduling , 2015, SOSP.

[33]  Eben Hewitt Cassandra - The Definitive Guide: Distributed Data at Web Scale , 2011 .

[34]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[35]  Michael Isard,et al.  Autopilot: automatic data center management , 2007, OPSR.

[36]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[37]  Wonho Kim,et al.  Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services , 2016, OSDI.

[38]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[39]  Randy H. Katz,et al.  Cake: enabling high-level SLOs on shared storage systems , 2012, SoCC '12.

[40]  Yin Wang,et al.  Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems , 2015, USENIX Annual Technical Conference.

[41]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[42]  Bu-Sung Lee,et al.  DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters , 2014, IEEE Transactions on Cloud Computing.

[43]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[44]  Zhe Wu,et al.  CosTLO: Cost-Effective Redundancy for Lower Latency Variance on Cloud Storage Services , 2015, NSDI.

[45]  Bo Fu,et al.  PBSE: a robust path-based speculative execution for degraded-network tail tolerance in data-parallel frameworks , 2017, SoCC.

[46]  Irfan Ahmad,et al.  PARDA: Proportional Allocation of Resources for Distributed Storage Access , 2009, FAST.

[47]  Jialin Li,et al.  Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency , 2014, SoCC.

[48]  Gregory R. Ganger,et al.  On the diversity of cluster workloads and its impact on research results , 2018, USENIX Annual Technical Conference.

[49]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[50]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[51]  Andrew A. Chien,et al.  MittOS: Supporting Millisecond Tail Tolerance with Fast Rejecting SLO-Aware OS Interface , 2017, SOSP.

[52]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[53]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[54]  Ethan Katz-Bassett,et al.  SPANStore: cost-effective geo-replicated storage spanning multiple cloud services , 2013, SOSP.

[55]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.