PREFAIL: a programmable tool for multiple-failure injection

As hardware failures are no longer rare in the era of cloud computing, cloud software systems must "prevail" against multiple, diverse failures that are likely to occur. Testing software against multiple failures poses the problem of combinatorial explosion of multiple failures. To address this problem, we present PreFail, a programmable failure-injection tool that enables testers to write a wide range of policies to prune down the large space of multiple failures. We integrate PreFail to three cloud software systems (HDFS, Cassandra, and ZooKeeper), show a wide variety of useful pruning policies that we can write for them, and evaluate the speed-ups in testing time that we obtain by using the policies. In our experiments, our testing approach with appropriate policies found all the bugs that one can find using exhaustive testing while spending 10X--200X less time than exhaustive testing.

[1]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[2]  Andrea C. Arpaci-Dusseau,et al.  FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.

[3]  Chris J. Price,et al.  Automated multiple failure FMEA , 2002, Reliab. Eng. Syst. Saf..

[4]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[5]  Craig Chambers,et al.  Ownership Domains: Separating Aliasing Policy from Mechanism , 2004, ECOOP.

[6]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[7]  Chen Fu,et al.  Testing of java web services for robustness , 2004, ISSTA '04.

[8]  George Candea,et al.  Crash-Only Software , 2003, HotOS.

[9]  Schahram Dustdar,et al.  Programmable Fault Injection Testbeds for Complex SOA , 2010, ICSOC.

[10]  Sarfraz Khurshid,et al.  Test generation through programming in UDITA , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[11]  Junfeng Yang,et al.  Using model checking to find serious file system errors , 2004, TOCS.

[12]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[13]  Sébastien Tixeuil,et al.  FAIL-FCI: Versatile fault injection , 2007, Future Gener. Comput. Syst..

[14]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[15]  Farnam Jahanian,et al.  Experiments on six commercial TCP implementations using a software fault injection tool , 1997 .

[16]  Peter M. Broadwell,et al.  FIG: A Prototype Tool for Online Verification of Recovery Mechanisms , 2002 .

[17]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[18]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[19]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[20]  Philip Koopman,et al.  Comparing the robustness of POSIX operating systems , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[21]  H KatzRandy,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988 .

[22]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[23]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[24]  Alex C. Snoeren,et al.  Decoupling policy from mechanism in Internet routing , 2004, Comput. Commun. Rev..

[25]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[26]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[27]  Koushik Sen,et al.  PreFail: A Programmable Failure-Injection Framework , 2011 .

[28]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[29]  Darko Marinov,et al.  Automated testing of refactoring engines , 2007, ESEC-FSE '07.

[30]  Sarfraz Khurshid,et al.  Korat: automated testing based on Java predicates , 2002, ISSTA '02.

[31]  Andrea C. Arpaci-Dusseau,et al.  Towards Automatically Checking Thousands of Failures with Micro-specifications , 2010, HotDep.

[32]  Radu Banabic,et al.  An Extensible Technique for High-Precision Testing of Recovery Code , 2010, USENIX Annual Technical Conference.

[33]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[34]  William A. Wulf,et al.  Policy/mechanism separation in Hydra , 1975, SOSP.

[35]  Lorenzo Keller,et al.  AFEX: An Automated Fault Explorer for Faster System Testing , 2008 .

[36]  George Candea,et al.  LFI: A practical and general library-level fault injector , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[37]  Neeraj Suri,et al.  Error propagation profiling of operating systems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[38]  Farnam Jahanian,et al.  Experiments on six commercial TCP implementations using a software fault injection tool , 1997, Softw. Pract. Exp..

[39]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.