Handling Crash and Software Faults Efficiently in Distributed Event Stream Processing

Active replication is a common approach to handle failures in distributed systems, including Event Stream Processing (ESP) systems. However, one weakness of conventional active replication is that replicas, being equal and in the same state, are susceptible to common-mode crashes due to software bugs. We propose a new approach to active replication that assumes a failure model stronger than fail-stop but weaker than models permitting arbitrary failures. We combine transactional memory and extended runtime checking to achieve: (i) low processing latency in failure-free runs by allowing downstream nodes to use speculative results and, thus, to circumvent the overhead added by the extended runtime checks; (ii) reduce the MTTR by enabling localized rollbacks (with word granularity) in several cases. We show that major limitations of n-variant active replication (e.g., multi-threading support, complex and slow recovery) can be overcome and tolerance to software bugs is orthogonal to Byzantine fault tolerance.

[1]  Andrey Brito,et al.  Multithreading-Enabled Active Replication for Event Stream Processing Operators , 2009, 2009 28th IEEE International Symposium on Reliable Distributed Systems.

[2]  Ramakrishna Kotla,et al.  Zyzzyva , 2007, SOSP.

[3]  Miguel Castro,et al.  Baggy Bounds Checking: An Efficient and Backwards-Compatible Defense against Out-of-Bounds Errors , 2009, USENIX Security Symposium.

[4]  Frank Ch. Eigler Mudflap: Pointer use checking for C/C , 2003 .

[5]  David Evans,et al.  N-Variant Systems: A Secretless Framework for Security through Diversity , 2006, USENIX Security Symposium.

[6]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[7]  Fred B. Schneider,et al.  Implementing trustworthy services using replicated state machines , 2005, IEEE Security & Privacy Magazine.

[8]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, , 2002 .

[9]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[10]  Flaviu Cristian,et al.  Exception Handling and Software Fault Tolerance , 1982, IEEE Transactions on Computers.

[11]  Liming Chen,et al.  N-VERSION PROGRAMMINC: A FAULT-TOLERANCE APPROACH TO RELlABlLlTY OF SOFTWARE OPERATlON , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[12]  Christof Fetzer,et al.  Perfect Failure Detection in Timed Asynchronous Systems , 2003, IEEE Trans. Computers.

[13]  Andrey Brito,et al.  Speculative out-of-order event processing with software transaction memory , 2008, DEBS.

[14]  Jason Flinn,et al.  Parallelizing security checks on commodity hardware , 2008, ASPLOS.

[15]  Olatunji Ruwase,et al.  A Practical Dynamic Buffer Overflow Detector , 2004, NDSS.

[16]  Torvald Riegel,et al.  Dynamic performance tuning of word-based software transactional memory , 2008, PPoPP.

[17]  Christof Fetzer,et al.  Prospect: a compiler framework for speculative parallelization , 2010, CGO '10.

[18]  Flaviu Cristian,et al.  Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement , 1995, Inf. Comput..

[19]  Christof Fetzer,et al.  Switchblade: enforcing dynamic personalized system call models , 2008, Eurosys '08.

[20]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[21]  Jeong-Hyon Hwang,et al.  Fast and Highly-Available Stream Processing over Wide Area Networks , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[22]  Gustavo Alonso,et al.  Using Optimistic Atomic Broadcast in Transaction Processing Systems , 2003, IEEE Trans. Knowl. Data Eng..

[23]  James Newsome,et al.  Dynamic Taint Analysis for Automatic Detection, Analysis, and SignatureGeneration of Exploits on Commodity Software , 2005, NDSS.

[24]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.