FaultSee: Reproducible Fault Injection in Distributed Systems

Distributed systems are increasingly important in modern society, often operating on a global scale with stringent dependability requirements. Despite the vast amount of research and the development of techniques to build dependable systems, faults are inevitable as one can witness from regular failures of major providers of IT services. It is therefore fundamental to evaluate distributed systems under different fault patterns and adversarial conditions to assess their high-level behaviour and minimize the occurrence of failures. However, succinctly capturing the system configuration, environment, fault patterns and other variables affecting an experiment is very hard, leading to a reproducibility crisis. In this paper we propose the FaultSee toolkit. The two components of FaultSee are (1) the simple and descriptive FDSL language that captures the system, environment, workload and fault pattern characteristics; and (2) an easy-to-use platform to deploy and run the experiments described by the language. FaultSee allows to precisely describe and reproduce experiments and leads to a better assessment the impact of faults in distributed systems. We showcase the key features of FaultSee by studying the impact of faults with real deployments of Apache Cassandra and BFT-Smart.

[1]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.

[2]  Andrea C. Arpaci-Dusseau,et al.  Redundancy Does Not Imply Fault Tolerance , 2017, ACM Trans. Storage.

[3]  Dennis Shasha,et al.  ReproZip: Computational Reproducibility With Ease , 2016, SIGMOD Conference.

[4]  Alysson Neves Bessani,et al.  State Machine Replication for the Masses with BFT-SMART , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[5]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[6]  S. Matteson The truth, the whole truth, and nothing but the truth. , 2012, Texas dental journal.

[7]  Christof Fetzer,et al.  Fex: A Software Systems Evaluator , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[8]  Matthias Hauswirth,et al.  Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.

[9]  Matthias Hauswirth,et al.  Evaluating the accuracy of Java profilers , 2010, PLDI '10.

[10]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[11]  Krzysztof Rzadca,et al.  Dfuntest: A Testing Framework for Distributed Applications , 2017, PPAM.

[12]  Dennis Shasha,et al.  ReproZip: The Reproducibility Packer , 2016, J. Open Source Softw..

[13]  Laurie A. Williams,et al.  Realizing quality improvement through test driven development: results and experiences of four industrial teams , 2008, Empirical Software Engineering.

[14]  Prasant Mohapatra,et al.  QRPp1-4: Characterizing Quality of Time and Topology in a Time Synchronization Network , 2006, IEEE Globecom 2006.

[15]  Boby George,et al.  A structured experiment of test-driven development , 2004, Inf. Softw. Technol..

[16]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[17]  Ariel Tseitlin The Antifragile Organization , 2013, ACM Queue.

[18]  Jan Vitek,et al.  Repeatability, reproducibility and rigor in systems research , 2011, 2011 Proceedings of the Ninth ACM International Conference on Embedded Software (EMSOFT).

[19]  David L. Mills,et al.  Internet Engineering Task Force (ietf) Network Time Protocol Version 4: Protocol and Algorithms Specification , 2010 .