Supporting Strong Reliability for Distributed Complex Event Processing Systems

Many application classes such as monitoring applications, involve processing a massive amount of data from a possibly huge number of data sources. Complex Event Processing (CEP) has evolved as the paradigm of choice to determine meaningful situations (complex events) by performing stepwise correlation over event streams. To keep up with the high scalability demands of growing input streams, recent approaches distribute event correlation over several correlation nodes. However, already a failure of a single correlation node impacts the correctness of the final correlation result. In this paper, we illustrate the importance of a strong reliability semantics for CEP in the context of a monitoring application in a distributed production environment. Strong reliability ensures each complex event is detected and delivered exactly once to each application entity, and cannot be guaranteed by the naive application of established replication principles. We present a replication scheme which ensures strong reliability in an asynchronous system model and can be applied to an arbitrary distributed CEP system. The algorithm tolerates f simultaneous failures by introducing f additional replicas for each correlation node. We prove correctness as well as evaluate the overhead introduced by the algorithm. Results show, that the overhead scales linearly with the number of deployed replicas and the node failure rate.

[1]  Peter R. Pietzuch,et al.  A Framework for Event Composition in Distributed Systems , 2003, Middleware.

[2]  Kurt Rothermel,et al.  Distributed heterogeneous event processing: enhancing scalability and interoperability of CEP in an industrial context , 2010, DEBS '10.

[3]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[4]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[5]  Frank Dürr,et al.  Solving the Multi-Operator Placement Problem in Large-Scale Operator Networks , 2010, 2010 Proceedings of 19th International Conference on Computer Communications and Networks.

[6]  Deepak S. Turaga,et al.  Towards Optimal Resource Allocation in Partial-Fault Tolerant Applications , 2008, IEEE INFOCOM 2008 - The 27th Conference on Computer Communications.

[7]  Hans-Arno Jacobsen,et al.  The PADRES Distributed Publish/Subscribe System , 2005, FIW.

[8]  Michael Stonebraker,et al.  High-availability algorithms for distributed stream processing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[9]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[10]  Michael Stonebraker,et al.  Fault-tolerance in the borealis distributed stream processing system , 2008, ACM Trans. Database Syst..

[11]  Kun-Lung Wu,et al.  Language level checkpointing support for stream processing applications , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[12]  Sharma Chakravarthy,et al.  Snoop: An Expressive Event Specification Language for Active Databases , 1994, Data Knowl. Eng..

[13]  Opher Etzion,et al.  Amit - the situation manager , 2003, The VLDB Journal.

[14]  Kurt Rothermel,et al.  Cordies: expressive event correlation in distributed systems , 2010, DEBS '10.

[15]  Andrey Brito,et al.  Speculative out-of-order event processing with software transaction memory , 2008, DEBS.

[16]  Kurt Rothermel,et al.  Efficient and Distributed Rule Placement in Heavy Constraint-Driven Event Systems , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[17]  Andrey Brito,et al.  Minimizing Latency in Fault-Tolerant Distributed Stream Processing Systems , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[18]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[19]  Jeong-Hyon Hwang,et al.  Fast and Highly-Available Stream Processing over Wide Area Networks , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[20]  Jeong-Hyon Hwang,et al.  Borealis-R: a replication-transparent stream processing system for wide-area monitoring applications , 2008, SIGMOD Conference.

[21]  Albert G. Greenberg,et al.  Fault-tolerant stream processing using a distributed, replicated file system , 2008, Proc. VLDB Endow..

[22]  Jean Bacon,et al.  Reliable complex event detection for pervasive computing , 2010, DEBS '10.

[23]  Marcos K. Aguilera Stumbling over Consensus Research: Misunderstandings and Issues , 2010, Replication.