A Highly Available Software Defined Fabric

Existing SDNs rely on a collection of intricate, mutually-dependent mechanisms to implement a logically centralized control plane. These cyclical dependencies and lack of clean separation of concerns can impact the availability of SDNs, such that a handful of link failures could render entire portions of an SDN non-functional. This paper shows why and when this could happen, and makes the case for taking a fresh look at architecting SDNs for robustness to faults from the ground up. Our approach carefully synthesizes various key distributed systems ideas -- in particular, reliable flooding, global snapshots, and replicated controllers. We argue informally that it can offer high availability in the face of a variety of network failures, but much work needs to be done to make our approach scalable and general. Thus, our paper represents a starting point for a broader discussion on approaches for building highly available SDNs.

[1]  Ming Zhang,et al.  MicroTE: fine grained traffic engineering for data centers , 2011, CoNEXT '11.

[2]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[3]  Vijay K. Garg,et al.  Scalable algorithms for global snapshots in distributed systems , 2006, ICS '06.

[4]  Francisco J. Ros,et al.  Five nines of southbound reliability in software-defined networks , 2014, HotSDN.

[5]  Friedemann Mattern,et al.  Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation , 1993, J. Parallel Distributed Comput..

[6]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[7]  Peter Bailis,et al.  The network is reliable , 2014, Commun. ACM.

[8]  Scott Shenker,et al.  CAP for networks , 2013, HotSDN '13.

[9]  Martín Casado,et al.  Onix: A Distributed Control Platform for Large-scale Production Networks , 2010, OSDI.

[10]  Abhijit Bose,et al.  Delayed Internet routing convergence , 2000, SIGCOMM.

[11]  Ming Zhang,et al.  A network-state management service , 2014 .

[12]  David Walker,et al.  Abstractions for network update , 2012, SIGCOMM '12.

[13]  Prasant Mohapatra,et al.  BGP convergence delay after multiple simultaneous router failures: Characterization and solutions , 2009, Comput. Commun..

[14]  Eric A. Brewer,et al.  Towards robust distributed systems (abstract) , 2000, PODC '00.

[15]  Martín Casado,et al.  Network Virtualization in Multi-tenant Datacenters , 2014, NSDI.

[16]  Srikanth Kandula,et al.  Achieving high utilization with software-driven WAN , 2013, SIGCOMM.

[17]  Min Zhu,et al.  B4: experience with a globally-deployed software defined wan , 2013, SIGCOMM.

[18]  Xin Jin,et al.  SoftCell: scalable and flexible cellular core network architecture , 2013, CoNEXT.

[19]  Ratul Mahajan,et al.  On consistent updates in software defined networks , 2013, HotNets.

[20]  Peter Bailis,et al.  The network is reliable , 2014 .