Overlay Routing under Geographically Correlated Failures in Distributed Event-Based Systems

In this paper we study the problem of enabling uninterrupted delivery of messages between endpoints, subject to spatially correlated failures in addition to independent failures. We developed a failure model-independent algorithm for computing routing paths based on failure correlations using both a-priory failure statistics together with available real-time monitoring information. The algorithm provides the most cost-efficient message routes that are potentially comprised of multiple simultaneous paths. We also designed and implemented an Internet-based overlay routing service that allows applications to construct and maintain highly resilient end-to-end paths. We have deployed our system over a set of geographically distributed Planetlab nodes. Our experimental results illustrate the feasibility and performance of our approach.

[1]  Andreas Haeberlen,et al.  Glacier: highly durable, decentralized storage despite massive correlated failures , 2005, NSDI.

[2]  Suman Nath,et al.  Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems , 2004, WORLDS.

[3]  Anjali Gupta,et al.  Efficient Routing for Peer-to-Peer Overlays , 2004, NSDI.

[4]  Charles J. Colbourn,et al.  The Combinatorics of Network Reliability , 1987 .

[5]  Marta M. B. Pascoal,et al.  A new implementation of Yen’s ranking loopless paths algorithm , 2003, 4OR.

[6]  Suvo Mittra,et al.  Iolus: a framework for scalable secure multicasting , 1997, SIGCOMM '97.

[7]  Randy H. Katz,et al.  Geographic Properties of Internet Routing , 2002, USENIX Annual Technical Conference, General Track.

[8]  Satish K. Tripathi,et al.  QoS aware path protection schemes for MPLS networks , 2002 .

[9]  Daniel S. Kirschen,et al.  Criticality in a cascading failure blackout model , 2006 .

[10]  Dinesh C. Verma,et al.  ALMI: An Application Level Multicast Infrastructure , 2001, USITS.

[11]  Umesh Bellur,et al.  Reliable Routing of Event Notifications over P2P Overlay Routing Substrate in Event Based Middleware , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[12]  Hari Balakrishnan,et al.  Resilient overlay networks , 2001, SOSP.

[13]  Gregory R. Ganger,et al.  Modeling Correlated Failures in Survivable Storage Systems , 2002 .

[14]  Tsuyoshi Takada,et al.  Macrospatial Correlation Model of Seismic Ground Motions , 2005 .

[15]  Zhen Liu,et al.  Cost-Effective Configuration of Content Resiliency Services Under Correlated Failures , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[16]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[17]  Kang G. Shin,et al.  A Primary-Backup Channel Approach to Dependable Real-Time Communication in Multihop Networks , 1998, IEEE Trans. Computers.

[18]  Hari Balakrishnan,et al.  Best-path vs. multi-path overlay routing , 2003, IMC '03.

[19]  Aravind Srinivasan,et al.  Resilient multicast using overlays , 2003, IEEE/ACM Transactions on Networking.

[20]  Srinivasan Seshan,et al.  Subtleties in Tolerating Correlated Failures in Wide-area Storage Systems , 2006, NSDI.

[21]  Srinivasan Seshan,et al.  IrisNet: an internet-scale architecture for multimedia sensors , 2005, MULTIMEDIA '05.

[22]  Randy H. Katz,et al.  Backup path allocation based on a correlated link failure probability model in overlay networks , 2002, 10th IEEE International Conference on Network Protocols, 2002. Proceedings..

[23]  Randy H. Katz,et al.  On failure detection algorithms in overlay networks , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[24]  Wenye Wang,et al.  Understanding the performance and resilience of large-scale multi-hop wireless networks , 2010 .

[25]  Mark Handley,et al.  Topologically-aware overlay construction and server selection , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[26]  Ramakrishna Kotla,et al.  SafeStore: A Durable and Practical Storage System , 2007, USENIX Annual Technical Conference.

[27]  Andreas Terzis,et al.  Fault-tolerant data delivery for multicast overlay networks , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..