When Software Defined Networks Meet Fault Tolerance: A Survey

Software Defined Network (SDN) is emerging as a novel network architecture which decouples the control plane from the data plane. However, SDN is unable to survive when facing failure, in particular in large scale data-center networks. Due to the programmability of SDN, mechanism could be designed to achieve fault tolerance. In this survey, we broadly discuss the fault tolerance issue and systematically review the existing methods proposed so far for SDN. Our representation starts from the significant components that OpenFlow and SDN brings – which are useful for the purpose of failure recovery, and is then further expanded to the discussion of fault tolerance in data plane and control plane, in which two phases – detection and recovery – are both needed. In particular, as the important part of this paper, we have highlighted the comparison between two main methods – restoration and protection – for failure recovery. Moreover, future research issues are discussed as well.

[1]  Nick McKeown,et al.  OpenFlow: enabling innovation in campus networks , 2008, CCRV.

[2]  J. W. Suurballe Disjoint paths in a network , 1974, Networks.

[3]  Matthew Roughan,et al.  The Internet Topology Zoo , 2011, IEEE Journal on Selected Areas in Communications.

[4]  Xinchang Zhang,et al.  A Survey of Multicast in Software-Defined Networking , 2015 .

[5]  Fernando M. V. Ramos,et al.  Software-Defined Networking: A Comprehensive Survey , 2014, Proceedings of the IEEE.

[6]  Anja Feldmann,et al.  Logically centralized?: state distribution trade-offs in software defined networks , 2012, HotSDN '12.

[7]  John Moy,et al.  OSPF Version 2 , 1998, RFC.

[8]  Steven S. W. Lee,et al.  Path layout planning and software based fast failure detection in survivable OpenFlow networks , 2014, 2014 10th International Conference on the Design of Reliable Communication Networks (DRCN).

[9]  Ulas C. Kozat,et al.  On diagnosis of forwarding plane via static forwarding rules in Software Defined Networks , 2013, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[10]  Jon G. Riecke,et al.  Stability issues in OSPF routing , 2001, SIGCOMM 2001.

[11]  Piero Castoldi,et al.  OpenFlow-based segment protection in Ethernet networks , 2013, IEEE/OSA Journal of Optical Communications and Networking.

[12]  Piet Demeester,et al.  Network Recovery: Protection and Restoration of Optical, SONET-SDH, IP, and MPLS , 2004 .

[13]  Nick Feamster,et al.  CORONET: Fault tolerance for Software Defined Networks , 2012, 2012 20th IEEE International Conference on Network Protocols (ICNP).

[14]  Li Xin,et al.  A framework of using OpenFlow to handle transient link failure , 2011, Proceedings 2011 International Conference on Transportation, Mechanical, and Electrical Engineering (TMEE).

[15]  Hamid Farhadi,et al.  Software-Defined Networking: A survey , 2015, Comput. Networks.

[16]  Dan Li,et al.  The problems and solutions of network update in SDN: A survey , 2015, 2015 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[17]  Marco Canini,et al.  FatTire: declarative fault tolerance for software-defined networks , 2013, HotSDN '13.

[18]  Yashar Ganjali,et al.  HyperFlow: A Distributed Control Plane for OpenFlow , 2010, INM/WREN.

[19]  Tony Przygienda,et al.  M-ISIS: Multi Topology (MT) Routing in Intermediate System to Intermediate Systems (IS-ISs) , 2008, RFC.

[20]  Francesco Palmieri,et al.  An HLA‐based framework for simulation of large‐scale critical systems , 2016, Concurr. Comput. Pract. Exp..

[21]  Lieguang Zeng,et al.  M2cloud: software defined multi-site data center network control framework for multi-tenant , 2013, SIGCOMM.

[22]  Alia Atlas,et al.  Basic Specification for IP Fast Reroute: Loop-Free Alternates , 2008, RFC.

[23]  Didier Colle,et al.  OpenFlow: Meeting carrier-grade recovery requirements , 2013, Comput. Commun..

[24]  Edjard de Souza Mota,et al.  A replication component for resilient OpenFlow-based networking , 2012, 2012 IEEE Network Operations and Management Symposium.

[25]  Norihiko Shinomiya,et al.  A Failure Recovery Method Based on Cycle Structure and Its Verification by OpenFlow , 2013, 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA).

[26]  Thyaga Nandagopal,et al.  Coping with link failures in centralized control plane architectures , 2010, 2010 Second International Conference on COMmunication Systems and NETworks (COMSNETS 2010).

[27]  Didier Colle,et al.  Fast failure recovery for in-band OpenFlow networks , 2013, 2013 9th International Conference on the Design of Reliable Communication Networks (DRCN).

[28]  Mohamed Faten Zhani,et al.  DOT: distributed OpenFlow testbed , 2015, SIGCOMM.

[29]  Chen-Nee Chuah,et al.  Proactive vs reactive approaches to failure resilient routing , 2004, IEEE INFOCOM 2004.

[30]  Jae-Hyoung Yoo,et al.  Scalable failover method for Data Center Networks using OpenFlow , 2014, 2014 IEEE Network Operations and Management Symposium (NOMS).

[31]  Fernando M. V. Ramos,et al.  On the Feasibility of a Consistent and Fault-Tolerant Data Store for SDNs , 2013, 2013 Second European Workshop on Software Defined Networks.

[32]  Stewart Bryant,et al.  Internet Engineering Task Force (ietf) a Framework for Ip and Mpls Fast Reroute Using Not-via Addresses , 2022 .

[33]  Michael J. Freedman,et al.  Ravana: controller fault-tolerance in software-defined networking , 2015, SOSR.

[34]  Tongquan Wei,et al.  Quasi-static fault-tolerant scheduling schemes for energy-efficient hard real-time systems , 2012, J. Syst. Softw..

[35]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[36]  Olivier Bonaventure,et al.  Achieving sub-50 milliseconds recovery upon BGP peering link failures , 2007, TNET.

[37]  Didier Colle,et al.  Software defined networking: Meeting carrier grade requirements , 2011, 2011 18th IEEE Workshop on Local & Metropolitan Area Networks (LANMAN).

[38]  Abhay Roy,et al.  Multi-Topology (MT) Routing in OSPF , 2007, RFC.

[39]  Didier Colle,et al.  Pan-European Optical Transport Networks: An Availability-based Comparison , 2004, Photonic Network Communications.

[40]  Christian Esteve Rothenberg,et al.  SlickFlow: Resilient source routing in Data Center Networks unlocked by OpenFlow , 2013, 38th Annual IEEE Conference on Local Computer Networks.

[41]  Min Zhu,et al.  B4: experience with a globally-deployed software defined wan , 2013, SIGCOMM.