Maelstrom: Mitigating Datacenter-level Disasters by Draining Interdependent Traffic Safely and Efficiently
暂无分享,去创建一个
Tianyin Xu | Kaushik Veeraraghavan | Justin Meza | Sankaralingam Panneerselvam | Alex Gyori | Yee Jiun Song | Daniel Obenshain | Ashish Shah | Shruti Padmanabha | David Chou | Sonia Margulis | Scott Michelson | Justin Meza | Tianyin Xu | Sonia Margulis | YeeJiun Song | A. Gyori | Daniel Obenshain | K. Veeraraghavan | S. Panneerselvam | David Chou | Shruti Padmanabha | S. Michelson | Ashish Shah
[1] Tanakorn Leesatapornwongsa,et al. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.
[2] Joseph M. Hellerstein,et al. Lineage-driven Fault Injection , 2015, SIGMOD Conference.
[3] Onur Mutlu,et al. A Large Scale Study of Data Center Network Reliability , 2018, Internet Measurement Conference.
[4] Noah Treuhaft,et al. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .
[5] Thomas F. Wenisch,et al. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.
[6] Ítalo S. Cunha,et al. Engineering Egress with Edge Fabric: Steering Oceans of Content to the World , 2017, SIGCOMM.
[7] Andreas Haeberlen,et al. One Primitive to Diagnose Them All: Architectural Support for Internet Diagnostics , 2017, EuroSys.
[8] Zuoning Yin,et al. How Do Fixes Become Bugs? A Comprehensive Characteristic Study on Incorrect Fixes in Commercial and Open Source Operating Systems , 2011 .
[9] Thomas A. Limoncelli,et al. Resilience Engineering: Learning to Embrace Failure , 2012, ACM Queue.
[10] Costin Raiciu,et al. Stateless Datacenter Load-balancing with Beamer , 2018, NSDI.
[11] Kaushik Veeraraghavan,et al. Canopy: An End-to-End Performance Tracing And Analysis System , 2017, SOSP.
[12] Andrea C. Arpaci-Dusseau,et al. FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.
[13] David A. Patterson,et al. Undo for Operators: Building an Undoable E-mail Store , 2003, USENIX Annual Technical Conference, General Track.
[14] Yang Liu,et al. Be conservative: enhancing failure diagnosis with proactive logging , 2012, OSDI 2012.
[15] Xuezheng Liu,et al. D3S: Debugging Deployed Distributed Systems , 2008, NSDI.
[16] Niall Murphy,et al. Site Reliability Engineering: How Google Runs Production Systems , 2016 .
[17] Kok-Kiong Yap,et al. Taking the Edge off with Espresso: Scale, Reliability and Programmability for Global Internet Peering , 2017, SIGCOMM.
[18] Ariel Tseitlin. The Antifragile Organization , 2013, ACM Queue.
[19] Ashish Gupta,et al. High-Availability at Massive Scale: Building Google's Data Infrastructure for Ads , 2015, BIRTE.
[20] Dirk Beyer,et al. Designing for Disasters , 2004, FAST.
[21] Wonho Kim,et al. Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services , 2016, OSDI.
[22] Yuanyuan Zhou,et al. Early Detection of Configuration Errors to Reduce Failure Damage , 2016, USENIX Annual Technical Conference.
[23] Luiz André Barroso,et al. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.
[24] Peter Alvaro,et al. Abstracting the Geniuses Away from Failure Testing , 2017, ACM Queue.
[25] Raul Landa,et al. Balancing on the Edge: Transport Affinity without Network State , 2018, NSDI.
[26] Abhishek Verma,et al. Large-scale cluster management at Google with Borg , 2015, EuroSys.
[27] Kripa Krishnan. Weathering the Unexpected , 2012, ACM Queue.
[28] Michael Kehoe,et al. TrafficShift: Avoiding Disasters at Scale , 2017 .
[29] Ding Yuan,et al. Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach , 2017, SOSP.
[30] Van-Anh Truong,et al. Availability in Globally Distributed Storage Systems , 2010, OSDI.
[31] Peter Alvaro,et al. Automating Failure Testing Research at Internet Scale , 2016, SoCC.
[32] Carlo Contavalli,et al. Maglev: A Fast and Reliable Software Network Load Balancer , 2016, NSDI.
[33] Yuanyuan Zhou,et al. Do not blame users for misconfigurations , 2013, SOSP.
[34] Yu Luo,et al. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.
[35] Jeffrey C. Mogul,et al. Thinking about Availability in Large Service Infrastructures , 2017, HotOS.
[36] Michael Dahlin,et al. The calculus of service availability , 2017, CACM.
[37] Robert Karl,et al. Holistic configuration management at Facebook , 2015, SOSP.
[38] Joel Wein,et al. ACMS: the Akamai configuration management system , 2005, NSDI.
[39] Daniel M. Roy,et al. Enhancing Server Availability and Security Through Failure-Oblivious Computing , 2004, OSDI.
[40] John Allspaw,et al. Fault Injection in Production: Making the case for resilience testing , 2012 .
[41] Jon Howell,et al. Slicer: Auto-Sharding for Datacenter Applications , 2016, OSDI.
[42] James R. Larus,et al. Orleans: cloud computing for everyone , 2011, SoCC.
[43] Haryadi S. Gunawi,et al. Why Does the Cloud Stop Computing?: Lessons from Hundreds of Service Outages , 2016, SoCC.
[44] Archana Ganapathi,et al. Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.
[45] Michael Abd-El-Malek,et al. Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.
[46] George Candea,et al. Microreboot - A Technique for Cheap Recovery , 2004, OSDI.
[47] Ramesh Govindan,et al. Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure , 2016, SIGCOMM.
[48] Yuanyuan Zhou,et al. Rx: treating bugs as allergies---a safe method to survive software failures , 2005, SOSP '05.
[49] Tanakorn Leesatapornwongsa,et al. The Case for Drill-Ready Cloud Computing , 2014, SoCC.
[50] Qi Huang,et al. Gorilla: A Fast, Scalable, In-Memory Time Series Database , 2015, Proc. VLDB Endow..
[51] Robert B. Ross,et al. Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems , 2018, FAST.