Partition-Tolerant Distributed Publish/Subscribe Systems

In this paper, we develop reliable distributed publish/subscribe algorithms that can tolerate concurrent failure of up to d broker machines or communication links. In our approach, d is a configuration parameter which determines the level of fault-tolerance of the system and reliability refers to exactly-once and per-source, in-order delivery of publications to clients with matching subscriptions. We propose protocols to address three problems in presence of broker or link failures: (i) subscription propagation, (ii) publication forwarding, and (iii) broker recovery. Finally, we study the effectiveness of our approach when the number of concurrent failures exceeds d. Through large-scale experimental evaluations with up to 500 brokers, we demonstrate that a system configured with a modest value of d = 3 is able to reliably deliver 97% of publications in presence of failure of up to 17% of its brokers.

[1]  Yuanyuan Zhao,et al.  Subscription Propagation and Content-Based Routing with Delivery Guarantees , 2005, DISC.

[2]  David S. Rosenblum,et al.  Design and evaluation of a wide-area event notification service , 2001, TOCS.

[3]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[4]  Hans-Arno Jacobsen,et al.  The PADRES Distributed Publish/Subscribe System , 2005, FIW.

[5]  Hans-Arno Jacobsen,et al.  Load Balancing Content-Based Publish/Subscribe Systems , 2010, TOCS.

[6]  Hans-Arno Jacobsen,et al.  Dynamic Load Balancing in Distributed Content-Based Publish/Subscribe , 2006, Middleware.

[7]  Alfonso Fuggetta,et al.  The JEDI Event-Based Infrastructure and Its Application to the Development of the OPSS WFMS , 2001, IEEE Trans. Software Eng..

[8]  Hans-Arno Jacobsen,et al.  A distributed service-oriented architecture for business process execution , 2010, TWEB.

[9]  Pascal Felber,et al.  XNET: a reliable content-based publish/subscribe system , 2004, Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004..

[10]  Peter R. Pietzuch,et al.  Hermes: a distributed event-based middleware architecture , 2002, Proceedings 22nd International Conference on Distributed Computing Systems Workshops.

[11]  Hari Balakrishnan,et al.  Resilient overlay networks , 2001, SOSP.

[12]  Amy L. Murphy,et al.  Minimizing the reconfiguration overhead in content-based publish-subscribe , 2004, SAC '04.

[13]  Amy L. Murphy,et al.  Efficient content-based event dispatching in the presence of topological reconfiguration , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[14]  Alex C. Snoeren,et al.  Mesh-based content routing using XML , 2001, SOSP.

[15]  Saurabh Bagchi,et al.  Exactly-once delivery in a content-based publish-subscribe system , 2002, Proceedings International Conference on Dependable Systems and Networks.

[16]  Reza Sherafat Kazemzadeh,et al.  Reliable and Highly Available Distributed Publish/Subscribe Service , 2009, 2009 28th IEEE International Symposium on Reliable Distributed Systems.

[17]  Joshua S. Auerbach,et al.  Exploiting IP Multicast in Content-Based Publish-Subscribe Systems , 2000, Middleware.

[18]  Matt Welsh,et al.  Cobra: Content-based Filtering and Aggregation of Blogs and RSS Feeds , 2007, NSDI.

[19]  David R. Cheriton,et al.  OTERS (on-tree efficient recovery using subcasting): a reliable multicast protocol , 1998, Proceedings Sixth International Conference on Network Protocols (Cat. No.98TB100256).

[20]  Hans-Arno Jacobsen,et al.  Efficient event processing through reconfigurable hardware for algorithmic trading , 2010, Proc. VLDB Endow..