Responding to Network Failures at Data-plane Speeds with Network Programmability

Measurement studies show that equipment failures happen quite frequently and pose a challenge to reliable network operation. Quickly recovering from failures is critical to meeting service guarantees. Traditional routing protocols, due to being executed in a distributed fashion and involving multiple devices in a network, require non-negligible time to recompute routes upon failures. SDN with OpenFlow simplifies route recomputation, but the time to compute and install alternative forwarding entries can still result in significant packet loss. Existing fast failover mechanisms cannot handle all types of failure and do not guarantee the use of the best paths. In this paper, we present FELIX, an approach for failure recovery that reroutes around failures at data plane timescales. Felix works by efficiently pre-computing tactics to handle failure scenarios that can be quickly activated in the data plane in response to failures. Our evaluation shows that our approach can recover from failures up to three orders of magnitude faster than existing SDN approaches.

[1]  Xuwei Yang,et al.  TRUST: Real-Time Request Updating with Elastic Resource Provisioning in Clouds , 2022, IEEE INFOCOM 2022 - IEEE Conference on Computer Communications.

[2]  Loris D'Antoni,et al.  D2R: Policy-Compliant Fast Reroute , 2021, SOSR.

[3]  Yi Wang,et al.  FastUp: Fast TCAM Update for SDN Switches in Datacenter Networks , 2021, 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS).

[4]  Jin Zhao,et al.  P4Neighbor: Efficient Link Failure Recovery With Programmable Switches , 2021, IEEE Transactions on Network and Service Management.

[5]  Kirill Levchenko,et al.  IntSight: diagnosing SLO violations with in-band network telemetry , 2020, CoNEXT.

[6]  Stefan Schmid,et al.  PURR: a primitive for reconfigurable fast reroute: hope for the best and program for the worst , 2019, CoNEXT.

[7]  Stefano Secci,et al.  Efficient Recovery Path Computation for Fast Reroute in Large-Scale Software-Defined Networks , 2019, IEEE Journal on Selected Areas in Communications.

[8]  Kuo-Feng Hsu,et al.  Contra: A Programmable System for Performance-aware Routing , 2019, NSDI.

[9]  Pierre Schaus,et al.  REPETITA: Repeatable Experiments for Performance Evaluation of Traffic-Engineering Algorithms , 2017, ArXiv.

[10]  Theophilus Benson,et al.  The Case for Making Tight Control Plane Latency Guarantees in SDN Switches , 2017, SOSR.

[11]  Davide Sanvito,et al.  Fast failure detection and recovery in SDN with stateful data plane , 2016, Int. J. Netw. Manag..

[12]  Ramesh Govindan,et al.  Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure , 2016, SIGCOMM.

[13]  Stewart Bryant,et al.  Remote Loop-Free Alternate (LFA) Fast Reroute (FRR) , 2015, RFC.

[14]  Maciej Kuźniar,et al.  What You Need to Know About SDN Flow Tables , 2015, PAM.

[15]  Andrew W. Moore,et al.  NetFPGA SUME: Toward 100 Gbps as Research Commodity , 2014, IEEE Micro.

[16]  George Varghese,et al.  P4: programming protocol-independent packet processors , 2013, CCRV.

[17]  George Varghese,et al.  Programming Protocol-Independent Packet Processors , 2013, ArXiv.

[18]  George Varghese,et al.  Forwarding metamorphosis: fast programmable match-action processing in hardware for SDN , 2013, SIGCOMM.

[19]  Alex C. Snoeren,et al.  High-fidelity switch models for software-defined network emulation , 2013, HotSDN '13.

[20]  Junda Liu,et al.  Ensuring connectivity via data plane mechanisms , 2013, NSDI 2013.

[21]  S. Savage,et al.  On Failure in Managed Enterprise Networks , 2012 .

[22]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[23]  Nick McKeown,et al.  A network in a laptop: rapid prototyping for software-defined networks , 2010, Hotnets-IX.

[24]  Stefan Savage,et al.  California fault lines: understanding the causes and impact of network failures , 2010, SIGCOMM '10.

[25]  Dave Katz,et al.  Bidirectional Forwarding Detection (BFD) , 2010, RFC.

[26]  S. Gjessing,et al.  Multiple Routing Configurations for Fast IP Network Recovery , 2009, IEEE/ACM Transactions on Networking.

[27]  Alia Atlas,et al.  Basic Specification for IP Fast Reroute: Loop-Free Alternates , 2008, RFC.

[28]  Chen-Nee Chuah,et al.  Characterization of Failures in an Operational IP Backbone Network , 2008, IEEE/ACM Transactions on Networking.

[29]  Nick McKeown,et al.  OpenFlow: enabling innovation in campus networks , 2008, CCRV.

[30]  Chen-Nee Chuah,et al.  Fast Local Rerouting for Handling Transient Link Failures , 2007, IEEE/ACM Transactions on Networking.

[31]  Chen-Nee Chuah,et al.  Failure Inferencing Based Fast Rerouting for Handling Transient Link and Node Failures , 2005, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[32]  Chen-Nee Chuah,et al.  Proactive vs reactive approaches to failure resilient routing , 2004, IEEE INFOCOM 2004.

[33]  Aditi Thakkar Multiple Routing Configurations for Fast IP Network Recovery , 2014 .

[34]  A. Krishnamurthy,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 399 F10: a Fault-tolerant Engineered Network , 2022 .