Tolerating SDN application failures with LegoSDN

Despite Software Defined Network’s(SDN) provenbenefits, there remains significant reluctance in adopting it. Among the issues that hamper SDN’s adoption two stand out: reliability and fault tolerance. At the heart of these issues is a set of fate-sharing relationships: The first between the SDN-Apps and controllers, where-in the crash of the former induces a crash of the latter, and thereby affecting availability; and, the second between the SDN-App and the network, where-in a byzantine failure e.g., black-holes and networkloops, induces a failure in the network, and thereby affecting network availability. The principal position of this paper is that availability is of utmost concern – second only to security. To this end, we present a re-design of the controller architecture centering around a set of abstractions to eliminate these fate-sharing relationships, and make the controllers and network resilient to SDN-App failures. We illustrate how these abstractions can be used to improve the reliability of an SDN environment, thus eliminating one of the barriers to SDN’s adoption.

[1]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[2]  David Walker,et al.  Incremental consistent updates , 2013, HotSDN '13.

[3]  Marco Canini,et al.  FatTire: declarative fault tolerance for software-defined networks , 2013, HotSDN '13.

[4]  Marco Canini,et al.  Automatic failure recovery for software-defined networks , 2013, HotSDN '13.

[5]  Nick McKeown,et al.  Leveraging SDN layering to systematically troubleshoot networks , 2013, HotSDN '13.

[6]  Hani Jamjoom,et al.  Cementing high availability in openflow with RuleBricks , 2013, HotSDN '13.

[7]  Min Zhu,et al.  B4: experience with a globally-deployed software defined wan , 2013, SIGCOMM.

[8]  Asim Kadav,et al.  Fine-grained fault tolerance using device checkpoints , 2013, ASPLOS '13.

[9]  Brighten Godfrey,et al.  VeriFlow: verifying network-wide invariants in real time , 2012, HotSDN '12.

[10]  Marcos Rogério Salvador,et al.  Revisiting routing control platforms with the eyes and muscles of software-defined networking , 2012, HotSDN '12.

[11]  Aditya Akella,et al.  Stratos: Virtual Middleboxes as First-Class Entities , 2012 .

[12]  Marco Canini,et al.  A NICE Way to Test OpenFlow Applications , 2012, NSDI.

[13]  Sujata Banerjee,et al.  DevoFlow: scaling flow management for high-performance networks , 2011, SIGCOMM.

[14]  Martín Casado,et al.  Onix: A Distributed Control Platform for Large-scale Production Networks , 2010, OSDI.

[15]  Martín Casado,et al.  NOX: towards an operating system for networks , 2008, CCRV.

[16]  Yuanyuan Zhou,et al.  Rx: treating bugs as allergies---a safe method to survive software failures , 2005, SOSP '05.

[17]  Brian N. Bershad,et al.  Improving the reliability of commodity operating systems , 2005, TOCS.

[18]  Brian N. Bershad,et al.  Recovering device drivers , 2004, TOCS.

[19]  Daniel M. Roy,et al.  Enhancing Server Availability and Security Through Failure-Oblivious Computing , 2004, OSDI.

[20]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[21]  Alan P. Wood,et al.  Software Reliability from the Customer View , 2003, Computer.

[22]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[23]  Martín Casado,et al.  Applying NOX to the Datacenter , 2009, HotNets.