Rollback-recovery without checkpoints in distributed event processing systems

Reliability is of critical importance to many applications involving distributed event processing systems. Especially the use of stateful operators makes it challenging to provide efficient recovery from failures and to ensure consistent event streams. Even during failure-free execution, state-of-the-art methods for achieving reliability incur significant overhead at run-time concerning computational resources, event traffic, and event detection time. This paper proposes a novel method for rollback-recovery that allows for recovery from multiple simultaneous operator failures, but eliminates the need for persistent checkpoints. Thereby, the operator state is preserved in \emph{savepoints} at points in time when its execution solely depends on the state of incoming event streams which are reproducible by predecessor operators. We propose an expressive event processing model to determine savepoints and algorithms for their coordination in a distributed operator network. Evaluations show that very low overhead at failure-free execution in comparison to other approaches is achieved.

[1]  Sharma Chakravarthy,et al.  Snoop: An Expressive Event Specification Language for Active Databases , 1994, Data Knowl. Eng..

[2]  Kurt Rothermel,et al.  Supporting Strong Reliability for Distributed Complex Event Processing Systems , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[3]  Michael Stonebraker,et al.  Fault-tolerance in the borealis distributed stream processing system , 2008, ACM Trans. Database Syst..

[4]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[5]  Fred B. Schneider,et al.  The primary-backup approach , 1993 .

[6]  Jeong-Hyon Hwang,et al.  Fast and Highly-Available Stream Processing over Wide Area Networks , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[7]  Jeong-Hyon Hwang,et al.  Borealis-R: a replication-transparent stream processing system for wide-area monitoring applications , 2008, SIGMOD Conference.

[8]  Andrey Brito,et al.  Minimizing Latency in Fault-Tolerant Distributed Stream Processing Systems , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[9]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[10]  Jean Bacon,et al.  Reliable complex event detection for pervasive computing , 2010, DEBS '10.

[11]  Kurt Rothermel,et al.  MigCEP: operator migration for mobility driven distributed complex event processing , 2013, DEBS.

[12]  Deepak S. Turaga,et al.  Towards Optimal Resource Allocation in Partial-Fault Tolerant Applications , 2008, IEEE INFOCOM 2008 - The 27th Conference on Computer Communications.

[13]  Alessandro Margara,et al.  Processing flows of information: From data stream to complex event processing , 2012, CSUR.

[14]  Albert G. Greenberg,et al.  Fault-tolerant stream processing using a distributed, replicated file system , 2008, Proc. VLDB Endow..

[15]  Kun-Lung Wu,et al.  Language level checkpointing support for stream processing applications , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[16]  Kostas Magoutis,et al.  CEC: Continuous eventual checkpointing for data stream processing operators , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[17]  Massimo Franceschetti,et al.  A Leader Election Protocol for Fault Recovery in Asynchronous Fully-Connected Networks , 1998 .

[18]  Andrey Brito,et al.  Speculative out-of-order event processing with software transaction memory , 2008, DEBS.

[19]  Michael Stonebraker,et al.  A Comparison of Stream-Oriented High-Availability Algorithms , 2003 .

[20]  Michael Stonebraker,et al.  High-availability algorithms for distributed stream processing , 2005, 21st International Conference on Data Engineering (ICDE'05).