Load-Optimal Local Fast Rerouting for Resilient Networks

Reliable and highly available computer networks must implement resilient fast rerouting mechanisms: upon a link or node failure, an alternative route is determined quickly, without involving the network control plane. Designing such fast failover mechanisms capable of dealing with multiple concurrent failures however is challenging, as failover rules need to be installed proactively, i.e., ahead of time, without knowledge of the actual failures happening at runtime. Indeed, only little is known today about the design of resilient routing algorithms. This paper presents a deterministic local failover mechanism which we prove to result in a minimum network load for a wide range of communication patterns, solving an open problem. Our mechanism relies on the key insight that resilient routing essentially constitutes a distributed algorithm without coordination. Accordingly, we build upon the theory of combinatorial designs and develop a novel deterministic failover mechanism based on symmetric block design theory which tolerates a maximal number of Ω(n) link failures in an n-node network and in the worst-case, while always ensuring routing connectivity. In particular, we show that at least Ω(ϕ2) link failures are needed to generate a maximum link load of at least ϕ, which matches an existing bound on the number of link failures needed for an optimal failover scheme. We complement our formal analysis with simulations, showing that our approach outperforms prior schemes not only in the worst-case.

[1]  Chen-Nee Chuah,et al.  Fast Local Rerouting for Handling Transient Link Failures , 2007, IEEE/ACM Transactions on Networking.

[2]  Alia Atlas,et al.  Basic Specification for IP Fast Reroute: Loop-Free Alternates , 2008, RFC.

[3]  Harald Räcke,et al.  Minimizing Congestion in General Networks , 2002, FOCS.

[4]  Shlomi Dolev,et al.  Dynamic load balancing with group communication , 2006, Theor. Comput. Sci..

[5]  A. Robert Calderbank,et al.  Network Pricing and Rate Allocation with Content Provider Participation , 2009, IEEE INFOCOM 2009.

[6]  Minlan Yu,et al.  SIMPLE-fying middlebox policy enforcement using SDN , 2013, SIGCOMM.

[7]  Stefan Savage,et al.  California fault lines: understanding the causes and impact of network failures , 2010, SIGCOMM '10.

[8]  Douglas R. Stinson,et al.  Combinatorial designs: constructions and analysis , 2003, SIGA.

[9]  David Clark,et al.  A Purpose-built Global Network: Google’s Move to SDN , 2015, ACM Queue.

[10]  Tibor Cinkler,et al.  A Novel Loop-Free IP Fast Reroute Algorithm , 2007, EUNICE.

[11]  Allan Borodin,et al.  Routing, Merging, and Sorting on Parallel Models of Computation , 1985, J. Comput. Syst. Sci..

[12]  Alexander Russell,et al.  Distributed scheduling for disconnected cooperation , 2005, Distributed Computing.

[13]  Petr Kuznetsov,et al.  A distributed and robust SDN control plane for transactional network updates , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[14]  Joan Feigenbaum,et al.  On the Resilience of Routing Tables , 2012, ArXiv.

[15]  Alan L. Cox,et al.  Scalable Multi-Failure Fast Failover via Forwarding Table Compression , 2016, SOSR.

[16]  Joan Feigenbaum,et al.  Brief announcement: on the resilience of routing tables , 2012, PODC '12.

[17]  Amin Vahdat,et al.  Aspen trees: balancing data center fault tolerance, scalability and cost , 2013, CoNEXT.

[18]  Marco Chiesa,et al.  The quest for resilient (static) forwarding tables , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[19]  Jianping Wu,et al.  RPFP: IP fast reroute with providing complete protection and without using tunnels , 2013, 2013 IEEE/ACM 21st International Symposium on Quality of Service (IWQoS).

[20]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[21]  Srikanth Kandula,et al.  Achieving high utilization with software-driven WAN , 2013, SIGCOMM.

[22]  Dimitri P. Bertsekas,et al.  Distributed Algorithms for Generating Loop-Free Routes in Networks with Frequently Changing Topology , 1981, IEEE Trans. Commun..

[23]  Srikanth Kandula,et al.  Traffic engineering with forward fault correction , 2014, SIGCOMM.

[24]  Marco Chiesa,et al.  On the Resiliency of Randomized Routing Against Multiple Edge Failures , 2016, ICALP.

[25]  Srihari Nelakuditi,et al.  IP fast reroute with failure inferencing , 2007, INM '07.

[26]  Stefan Schmid,et al.  How (Not) to Shoot in Your Foot with SDN Local Fast Failover - A Load-Connectivity Tradeoff , 2013, OPODIS.

[27]  Bruce M. Maggs,et al.  Exploiting locality for data management in systems of limited bandwidth , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[28]  Leslie G. Valiant,et al.  A Scheme for Fast Parallel Communication , 1982, SIAM J. Comput..

[29]  Alan L. Cox,et al.  Plinko: building provably resilient forwarding tables , 2013, HotNets.

[30]  Athina Markopoulou,et al.  Characterization of failures in an IP backbone , 2004, IEEE INFOCOM 2004.

[31]  Min Zhu,et al.  B4: experience with a globally-deployed software defined wan , 2013, SIGCOMM.

[32]  Harald Räcke Survey on Oblivious Routing Strategies , 2009, CiE.