Multithreading-Enabled Active Replication for Event Stream Processing Operators

Event Stream Processing (ESP) systems are very popular in monitoring applications. Algorithmic trading, network monitoring and sensor networks are good examples of applications that rely upon ESP systems. As these systems become larger and more widely deployed, they have to answer increasingly stronger requirements that are often difficult to satisfy. Fault-tolerance is a good example of such a non-trivial requirement. Making ESP operators fault-tolerant can add considerable performance overhead to the application. In this paper, we focus on active replication as an approach to provide fault-tolerance to ESP operators. More precisely, we address the performance costs of active replication for operators in distributed ESP applications.We use a speculation mechanism based on Software Transactional Memory (STM) to achieve the following goals: (i) enable replicas to make progress using optimistic delivery; (ii) enable early forwarding of speculative computation results; (iii) enable active replication of multi-threaded operators using transactional executions. Experimental evaluation shows that, using this combination of mechanisms, one can implement highly efficient fault-tolerant ESP operators.

[1]  Christof Fetzer,et al.  Perfect Failure Detection in Timed Asynchronous Systems , 2003, IEEE Trans. Computers.

[2]  Keith Marzullo,et al.  Highly-available services using the primary-backup approach , 1992, [1992 Proceedings] Second Workshop on the Management of Replicated Data.

[3]  Xavier Défago,et al.  Semi-passive replication , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[4]  Eric A. Brewer,et al.  Highly available, fault-tolerant, parallel dataflows , 2004, SIGMOD '04.

[5]  Anne-Marie Déplanche,et al.  Implementing a semi-active replication strategy in CHORUS/ClassiX, a distributed real-time executive , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[6]  Michael Stonebraker,et al.  High-availability algorithms for distributed stream processing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[7]  Torvald Riegel,et al.  Transactifying Applications Using an Open Compiler Framework , 2007 .

[8]  Ravishankar K. Iyer,et al.  Active replication of multithreaded applications , 2006, IEEE Transactions on Parallel and Distributed Systems.

[9]  Andrey Brito,et al.  Minimizing Latency in Fault-Tolerant Distributed Stream Processing Systems , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[10]  Torvald Riegel,et al.  Dynamic performance tuning of word-based software transactional memory , 2008, PPoPP.

[11]  Andrey Brito,et al.  Speculative out-of-order event processing with software transaction memory , 2008, DEBS.

[12]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[13]  Jeong-Hyon Hwang,et al.  Fast and Highly-Available Stream Processing over Wide Area Networks , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[14]  Gustavo Alonso,et al.  Using Optimistic Atomic Broadcast in Transaction Processing Systems , 2003, IEEE Trans. Knowl. Data Eng..

[15]  Fred B. Schneider,et al.  The primary-backup approach , 1993 .

[16]  Ricardo Jiménez-Peris,et al.  Deterministic scheduling for transactional multithreaded replicas , 2000, Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000.

[17]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[18]  P. Reynier,et al.  Active replication in Delta-4 , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[19]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[20]  Jan Vitek,et al.  Streamflex: high-throughput stream programming in java , 2007, OOPSLA.

[21]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[22]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.