Maelstrom: Mitigating Datacenter-level Disasters by Draining Interdependent Traffic Safely and Efficiently

We present Maelstrom, a new system for mitigating and recovering from datacenter-level disasters. Maelstrom provides a traffic management framework with modular, reusable primitives that can be composed to safely and efficiently drain the traffic of interdependent services from one or more failing datacenters to the healthy ones. Maelstrom ensures safety by encoding inter-service dependencies and resource constraints. Maelstrom uses health monitoring to implement feedback control so that all specified constraints are satisfied by the traffic drains and recovery procedures executed during disaster mitigation. Maelstrom exploits parallelism to drain and restore independent traffic sources efficiently. We verify the correctness of Maelstrom’s disaster mitigation and recovery procedures by running large-scale tests that drain production traffic from entire datacenters and then retore the traffic back to the datacenters. These tests (termed drain tests) help us gain a deep understanding of our complex systems, and provide a venue for continually improving the reliability of our infrastructure. Maelstrom has been in production at Facebook for more than four years, and has been successfully used to mitigate and recover from 100+ datacenter outages.

[1]  Tanakorn Leesatapornwongsa,et al.  What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.

[2]  Joseph M. Hellerstein,et al.  Lineage-driven Fault Injection , 2015, SIGMOD Conference.

[3]  Onur Mutlu,et al.  A Large Scale Study of Data Center Network Reliability , 2018, Internet Measurement Conference.

[4]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[5]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[6]  Ítalo S. Cunha,et al.  Engineering Egress with Edge Fabric: Steering Oceans of Content to the World , 2017, SIGCOMM.

[7]  Andreas Haeberlen,et al.  One Primitive to Diagnose Them All: Architectural Support for Internet Diagnostics , 2017, EuroSys.

[8]  Zuoning Yin,et al.  How Do Fixes Become Bugs? A Comprehensive Characteristic Study on Incorrect Fixes in Commercial and Open Source Operating Systems , 2011 .

[9]  Thomas A. Limoncelli,et al.  Resilience Engineering: Learning to Embrace Failure , 2012, ACM Queue.

[10]  Costin Raiciu,et al.  Stateless Datacenter Load-balancing with Beamer , 2018, NSDI.

[11]  Kaushik Veeraraghavan,et al.  Canopy: An End-to-End Performance Tracing And Analysis System , 2017, SOSP.

[12]  Andrea C. Arpaci-Dusseau,et al.  FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.

[13]  David A. Patterson,et al.  Undo for Operators: Building an Undoable E-mail Store , 2003, USENIX Annual Technical Conference, General Track.

[14]  Yang Liu,et al.  Be conservative: enhancing failure diagnosis with proactive logging , 2012, OSDI 2012.

[15]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[16]  Niall Murphy,et al.  Site Reliability Engineering: How Google Runs Production Systems , 2016 .

[17]  Kok-Kiong Yap,et al.  Taking the Edge off with Espresso: Scale, Reliability and Programmability for Global Internet Peering , 2017, SIGCOMM.

[18]  Ariel Tseitlin The Antifragile Organization , 2013, ACM Queue.

[19]  Ashish Gupta,et al.  High-Availability at Massive Scale: Building Google's Data Infrastructure for Ads , 2015, BIRTE.

[20]  Dirk Beyer,et al.  Designing for Disasters , 2004, FAST.

[21]  Wonho Kim,et al.  Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services , 2016, OSDI.

[22]  Yuanyuan Zhou,et al.  Early Detection of Configuration Errors to Reduce Failure Damage , 2016, USENIX Annual Technical Conference.

[23]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[24]  Peter Alvaro,et al.  Abstracting the Geniuses Away from Failure Testing , 2017, ACM Queue.

[25]  Raul Landa,et al.  Balancing on the Edge: Transport Affinity without Network State , 2018, NSDI.

[26]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[27]  Kripa Krishnan Weathering the Unexpected , 2012, ACM Queue.

[28]  Michael Kehoe,et al.  TrafficShift: Avoiding Disasters at Scale , 2017 .

[29]  Ding Yuan,et al.  Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach , 2017, SOSP.

[30]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[31]  Peter Alvaro,et al.  Automating Failure Testing Research at Internet Scale , 2016, SoCC.

[32]  Carlo Contavalli,et al.  Maglev: A Fast and Reliable Software Network Load Balancer , 2016, NSDI.

[33]  Yuanyuan Zhou,et al.  Do not blame users for misconfigurations , 2013, SOSP.

[34]  Yu Luo,et al.  Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.

[35]  Jeffrey C. Mogul,et al.  Thinking about Availability in Large Service Infrastructures , 2017, HotOS.

[36]  Michael Dahlin,et al.  The calculus of service availability , 2017, CACM.

[37]  Robert Karl,et al.  Holistic configuration management at Facebook , 2015, SOSP.

[38]  Joel Wein,et al.  ACMS: the Akamai configuration management system , 2005, NSDI.

[39]  Daniel M. Roy,et al.  Enhancing Server Availability and Security Through Failure-Oblivious Computing , 2004, OSDI.

[40]  John Allspaw,et al.  Fault Injection in Production: Making the case for resilience testing , 2012 .

[41]  Jon Howell,et al.  Slicer: Auto-Sharding for Datacenter Applications , 2016, OSDI.

[42]  James R. Larus,et al.  Orleans: cloud computing for everyone , 2011, SoCC.

[43]  Haryadi S. Gunawi,et al.  Why Does the Cloud Stop Computing?: Lessons from Hundreds of Service Outages , 2016, SoCC.

[44]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[45]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[46]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[47]  Ramesh Govindan,et al.  Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure , 2016, SIGCOMM.

[48]  Yuanyuan Zhou,et al.  Rx: treating bugs as allergies---a safe method to survive software failures , 2005, SOSP '05.

[49]  Tanakorn Leesatapornwongsa,et al.  The Case for Drill-Ready Cloud Computing , 2014, SoCC.

[50]  Qi Huang,et al.  Gorilla: A Fast, Scalable, In-Memory Time Series Database , 2015, Proc. VLDB Endow..

[51]  Robert B. Ross,et al.  Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems , 2018, FAST.