Aaron: An adaptable execution environment

Software bugs and hardware errors are the largest contributors to downtime [1], and can be permanent (e.g. deterministic memory violations, broken memory modules) or transient (e.g. race conditions, bitflips). Although a large variety of dependability mechanisms exist, only few are used in practice. The existing techniques do not prevail for several reasons: (1) the introduced performance overhead is often not negligible, (2) the gained coverage is not sufficient, and (3) users cannot control and adapt the mechanism. Aaron tackles these challenges by detecting hardware and software errors using automatically diversified software components. It uses these software variants only if CPU spare cycles are present in the system. In this way, Aaron increases fault coverage without incurring a perceivable performance penalty. Our evaluation shows that Aaron provides the same throughput as an execution of the original application while checking a large percentage of requests — whenever load permits.

[1]  Xingshe Zhou,et al.  An Adaptive Performance Management Method for Failure Detection , 2008, 2008 Ninth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing.

[2]  Christof Fetzer,et al.  AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware , 2009, SAFECOMP.

[3]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[4]  David Brumley,et al.  SplitScreen: Enabling efficient, distributed malware detection , 2010, Journal of Communications and Networks.

[5]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[6]  Manuel Costa,et al.  Bouncer: securing software by blocking bad input , 2008, WRAITS '08.

[7]  David I. August,et al.  Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.

[8]  Jason Flinn,et al.  Parallelizing security checks on commodity hardware , 2008, ASPLOS.

[9]  Paul H. J. Kelly,et al.  Backwards-Compatible Bounds Checking for Arrays and Pointers in C Programs , 1997, AADEBUG.

[10]  Michael Dahlin,et al.  Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults , 2009, NSDI.

[11]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[12]  Cheng Wang,et al.  Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[13]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[14]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[15]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[16]  Michael K. Reiter,et al.  Fault-scalable Byzantine fault-tolerant services , 2005, SOSP '05.

[17]  Emery D. Berger,et al.  Exterminator: automatically correcting memory errors with high probability , 2007, PLDI '07.

[18]  Ion Stoica,et al.  Friday: Global Comprehension for Distributed Replay , 2007, NSDI.

[19]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[20]  Michael Stumm,et al.  Software error early detection system based on run-time statistical analysis of function return values , 2006 .

[21]  Wei Lin,et al.  WiDS Checker: Combating Bugs in Distributed Systems , 2007, NSDI.

[22]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[23]  George C. Necula,et al.  SafeDrive: safe and recoverable extensions using language-based techniques , 2006, OSDI '06.

[24]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[25]  Albert G. Greenberg,et al.  WebProphet: Automating Performance Prediction for Web Services , 2010, NSDI.

[26]  Daniel M. Roy,et al.  Enhancing Server Availability and Security Through Failure-Oblivious Computing , 2004, OSDI.

[27]  Mark Stephenson,et al.  Statistically regulating program behavior via mainstream computing , 2010, CGO '10.

[28]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[29]  Miguel Castro,et al.  Securing software by enforcing data-flow integrity , 2006, OSDI '06.

[30]  Christof Fetzer,et al.  Prospect: a compiler framework for speculative parallelization , 2010, CGO '10.

[31]  Yuanyuan Zhou,et al.  Rx: treating bugs as allergies---a safe method to survive software failures , 2005, SOSP '05.

[32]  Martín Abadi,et al.  XFI: software guards for system address spaces , 2006, OSDI '06.

[33]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[34]  Michael Franz,et al.  Orchestra: intrusion detection using parallel execution and monitoring of program variants in user-space , 2009, EuroSys '09.

[35]  Christof Fetzer,et al.  Speculation for Parallelizing Runtime Checks , 2009, SSS.

[36]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.