FATE and DESTINI: A Framework for Cloud Recovery Testing

As the cloud era begins and failures become commonplace, failure recovery becomes a critical factor in the availability, reliability and performance of cloud services. Unfortunately, recovery problems still take place, causing downtimes, data loss, and many other problems. We propose a new testing framework for cloud recovery: FATE (Failure Testing Service) and DESTINI (Declarative Testing Specifications). With FATE, recovery is systematically tested in the face of multiple failures. With DESTINI, correct recovery is specified clearly, concisely, and precisely. We have integrated our framework to several cloud systems (e.g., HDFS [33]), explored over 40,000 failure scenarios, wrote 74 specifications, found 16 new bugs, and reproduced 51 old bugs.

[1]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[2]  Ion Stoica,et al.  Implementing declarative overlays , 2005, SOSP '05.

[3]  Lorenzo Keller,et al.  AFEX: An Automated Fault Explorer for Faster System Testing , 2008 .

[4]  James R. Hamilton,et al.  On Designing and Deploying Internet-Scale Services , 2007, LISA.

[5]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[6]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[7]  Andrea C. Arpaci-Dusseau,et al.  SQCK: A Declarative File System Checker , 2008, OSDI.

[8]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[9]  Viktor Kuncak,et al.  CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems , 2009, NSDI.

[10]  Wei Lin,et al.  WiDS Checker: Combating Bugs in Distributed Systems , 2007, NSDI.

[11]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[12]  Benjamin Livshits,et al.  GATEKEEPER: Mostly Static Enforcement of Security and Reliability Policies for JavaScript Code , 2009, USENIX Security Symposium.

[13]  GhemawatSanjay,et al.  The Google file system , 2003 .

[14]  Alyssa H. Henry Keynote Address: Cloud Storage {FUD} (Failure, Uncertainty, and Durability) , 2009 .

[15]  Amin Vahdat,et al.  Life, death, and the critical transition: finding liveness bugs in systems code , 2007 .

[16]  Michael Burrows,et al.  The Chubby Lock Service for Loosely-Coupled Distributed Systems , 2006, OSDI.

[17]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[18]  Sape Mullender,et al.  Distributed systems , 1989 .

[19]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[20]  Robbert van Renesse,et al.  Toward a cloud computing research agenda , 2009, SIGA.

[21]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[22]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[23]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[24]  Junfeng Yang,et al.  EXPLODE: a lightweight, general system for finding serious storage system errors , 2006, OSDI '06.

[25]  Joseph M. Hellerstein,et al.  Boom analytics: exploring data-centric, declarative programming for the cloud , 2010, EuroSys '10.

[26]  Atul Singh,et al.  Using queries for distributed monitoring and forensics , 2006, EuroSys.

[27]  Andrew W. Appel,et al.  MulVAL: A Logic-based Network Security Analyzer , 2005, USENIX Security Symposium.

[28]  Andrea C. Arpaci-Dusseau,et al.  Towards Automatically Checking Thousands of Failures with Micro-specifications , 2010, HotDep.

[29]  Radu Banabic,et al.  An Extensible Technique for High-Precision Testing of Recovery Code , 2010, USENIX Annual Technical Conference.