SETSUDŌ: perturbation-based testing framework for scalable distributed systems

Modern scalable distributed systems are designed to be partition-tolerant. They are often required to support increasing load in service requests elastically, and to provide seamless services even when some servers malfunction. Partition-tolerance enables such systems to withstand arbitrary loss of messages as "perceived" by the communicating nodes. However, partition-tolerance and robustness are not tested rigorously in practice. Often severe system-level design defects stay hidden even after deployment, possibly resulting in loss of revenue or customer satisfaction. We propose a novel perturbation-based rigorous testing framework, named SETSUDŌ, especially targeted to expose system-level defects in scalable distributed systems. It applies perturbations (i.e., controlled changes) from the environment of a system during testing, and leverages awareness of system-internal states to precisely control their timing. It uses a flexible instrumentation framework to select relevant internal states and to implement the system code for perturbations. It also provides a test policy language framework, where sequences of perturbation scenarios at a high level are converted automatically to system-level test code. This test code is weaved-in automatically with application code during testing, and any observed defects are reported. We have implemented our perturbation testing framework and demonstrate its evaluation on several open source projects, where it was successful in exposing known, as well as some unknown, defects. Our framework leverages small-scale testing, and avoids upfront infrastructure costs typically needed for large-scale stress testing.

[1]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[2]  Schahram Dustdar,et al.  Programmable Fault Injection Testbeds for Complex SOA , 2010, ICSOC.

[3]  Sébastien Tixeuil,et al.  FAIL-FCI: Versatile fault injection , 2007, Future Gener. Comput. Syst..

[4]  Eric A. Brewer,et al.  Towards robust distributed systems (abstract) , 2000, PODC '00.

[5]  Koushik Sen,et al.  PREFAIL: a programmable tool for multiple-failure injection , 2011, OOPSLA '11.

[6]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[7]  Peter M. Broadwell,et al.  FIG: A Prototype Tool for Online Verification of Recovery Mechanisms , 2002 .

[8]  George Candea,et al.  LFI: A practical and general library-level fault injector , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[9]  Tushar Deepak Chandra,et al.  Paxos Made Live - An Engineering Perspective (2006 Invited Talk) , 2007 .

[10]  Farnam Jahanian,et al.  Experiments on six commercial TCP implementations using a software fault injection tool , 1997, Softw. Pract. Exp..

[11]  George Candea,et al.  Fast black-box testing of system recovery code , 2012, EuroSys '12.

[12]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[13]  Junfeng Yang,et al.  Practical software model checking via dynamic interface reduction , 2011, SOSP.

[14]  Lorenzo Keller,et al.  AFEX: An Automated Fault Explorer for Faster System Testing , 2008 .

[15]  Andrea C. Arpaci-Dusseau,et al.  FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.

[16]  George Candea,et al.  Efficient Testing of Recovery Code Using Fault Injection , 2011, TOCS.

[17]  Junfeng Yang,et al.  EXPLODE: a lightweight, general system for finding serious storage system errors , 2006, OSDI '06.

[18]  Junfeng Yang,et al.  Using model checking to find serious file system errors , 2004, TOCS.

[19]  LynchNancy,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002 .